Swire-CC Innovation Product Forecasting - Modelling

Authors: ¶

Michael Mendoza
Neil Samuel Pulukuri
Gnana Chaithanya Rawali Male
Abinav Yadamani

Table of Contents

1. Introduction¶

2. Project Goal & Analytics Approach

3. Data Description & Missing Value Analysis

4. Q1 Modelling¶

5. Q2 Modelling¶

6. Q3 Modelling¶

7. Q6 Modelling¶

8. Q4 Modelling¶

9. Q5 Modelling¶

10. Q7 Modelling¶

11. Team Contribution¶

image.png

Business Problem Introduction ¶

Innovative Product Business ¶

Launching innovative products is crucial for beverage companies to drive revenue growth, diversify the product portfolio, attract broader consumer categories, and strengthen brand visibility and loyalty. These products will also cater to recent trends over prioritizing health and sustainability. Innovative products also boosts profit as these are offered at premium pricing. Overall, product innovation is critical for staying relevant in a dynamic market, ultimately boosting revenue and sustaining long-term growth. Swire Coca-Cola is constantly introducing innovation products into the market.

Business Problem ¶

Swire Coca-Cola faces a business challenge in optimizing production planning and inventory management for its innovative beverage products segment. These products do not have a exact match of the historic sales and Swire in the past have faced situations on both the extremes, i.e., leaving money on the table due to lower estimations of exact demand and then over production leading to product distress. The company aims to accurately forecast the weekly demand, identify the most profitable region and time periods for launching these products using historical sales and market data of similar products of Swire and it's competitiors, along with customer demographic information to strike a balance between both out-of-stock and overproduction situations.

Project Goal ¶

The goals of this project are as follows:

  1. Match and map the 7 innovative products provided by Swire with the historical data.
  2. Depending on the question, identify the best region, time period and the forecasted demand.
  3. Use time series forecasting methods such as ARIMA, SARIMA and Prophet to forecast demand.
  4. Split the data into train-test for model evaluation to justify the models used and results obtained.
  5. Measure the model performance using error metrics like MAE, MAPE & RMSE by comparing the estimated demand with the actual values for the train and test sets.
  6. Use the model to forecast demand, time period or region.

Modelling Notebook Purpose¶

The purpose of this notebook to develop multiple models to accurately forecast demand on a weekly basis for the given 7 innovative products. These models will be used by Swire to solve the problem with planning production and inventory for these products with the outcome being Swire takes maximum advantage to boost revenue from these innovative products with minimal overproduction.

Modelling Approach & Assumptions

Approach

The products that we are dealing with are innovative products, meaning, these products do not have an exact replica of sales in the past. Without historical data, building a model will pose a major challenge and blocker to this project.

To overcome this, the approach that we will be taking is to match and map the product attributes (Flavor/ Category/ Caloric Segment/ Package/ Manufacturer/ Brand) by use them to filter the data and then develop forecasting model. It would not be possible to apply all the 6 filters at a time, hence, we will be using 2-3 sets of filters for modelling in order to capture all the nuances in the data. Once we have a forecast with the initial set of filters, we will account for the remaining set of filters.

Filtering Strategy

Why Flavor?

We will be filtering the dataset by flavor of the innovative product first because flavor is a primary determinant of consumer appeal and purchasing behavior in the beverage industry. Analyzing the performance of the specific flavor variant allows for a more targeted and accurate assessment of its market potential and forecasting demand. This targeted approach enables us to assess the demand and performance of the flavor independently of other variants or products within the same brand or category. Additionally, flavor is the a critical product attribute that might help capture seasonality in the dataset. This granularity in analysis improves the precision of our forecasting models

After flavor, the next important filters are brand/manufacturer, calory segment and category.
Why filtering by Category is important?
Analyzing the category, whether "Regular" or "Diet/Light," is crucial in understanding consumer purchase behavior due to the following reasons: - Health Consciousness & Lifestyle preferences: s. Consumers opting for "Diet/Light" products are likely motivated by health considerations, such as calorie control or sugar reduction, influencing their purchase decisions. This is critical especially in diabetic patients or in regions where we identified more senior population proportion. - Impact on Demand: Consumer perception of a product's category can significantly impact its demand dynamics. Changes in health trends or dietary recommendations may influence the popularity of "Diet/Light" products, while factors like taste innovations or marketing campaigns may drive demand for "Regular" offerings. Later in the modelling we have observed strong correlation between diet product sales and time of the year. Diet sales is the highest towards the beginining of the year and then have gradually declined. This might be becasue people prefer healthier and low sugar options at the start of the year through new year resolutions.

Why filtering by Caloric Segment is important?

Similar to Category, again consumer preferences led by taste and health are predominat in determining if they will purchase a specific water type or a energy drink.

Why filtering by manufacturer or Brand is important?

Brand Loyalty: Consumers often exhibit brand loyalty, preferring products from specific brands due to factors such as trust, reputation, and perceived quality. Analyzing data specific to a brand therefore will be a more reliable forecast. </p>

Assumptions

1. Region Mapping

In the scope of the dataset, we have assumed regions and below is the mapping for the same:

  • North West States: Washington (WA), Oregon (OR)

  • North States: Idaho (ID), Wyoming (WY)

  • South States: Arizona (AZ), New Mexico (NM)

  • West States: Nevada (NV), California (CA)

  • Central States: Utah (UT), Colorado (CO)

  • East States: Nebraska (NE), South Dakota (SD), Kansas (KS)


2. 13 weeks of the year - consecutive or best 13 of the year?

The questions provided by Swire does not specify if the 13 weeks needs to be consecutive or can be the top 13 weeks of the year in any random fashion. For the sake of this project, we will consider the best set of 13 consecutive weeks. The reason for this is simple: In the inital couple of meetings with the Swire stakeholders, we were notified that based on how well the product performs in the following weeks post launch, retail stores shall be restocked. Morever, it might be a feasible option for Swire to launch a new product only for 1 week and then again restock several months later, which is not reasonable at all. Keeping all these factors in mind, in the scope of this project, when the questions asks which 13 weeks of the year, we will assume which consecutive 13 weeks of the year.

3. Empahsis on Unit Sales over Dollar Sales?

The business problem revolves around forecasting sales and not predicting revenue. Therefore, we will prioritize unit sales as it provide better insights into how many units Swire will need to be ready before launching a product, and the necessary restocking of the retail stores. This will also eliminate multicollinearity as identified in the EDA phase. </p>

4. Potential outliers?

In the notebook, you will find outliers detected for few products. However, since we were informed that the data has gonr through thorough cleaning already by the stakeholders before handing it over to us, we will not remove any outliers.

Data Preparation¶

Data Preparation is the process of transforming raw data into a format that is suitable for analysis and modeling. This includes steps such as variable transformations, feature engineering and handling NAs. In the context of this project, data preparation will vary depending on the question. Some examples of data preparation are as follows: 1. For instance, questions with region involved will need a column to be added for marking regions based on the states that we mapped from the EDA. 2. Another instance of data preparation is that for the Prophet model, the inputs to the model needs to be formatted as 'ds' and 'y' columns only where ds is the date column and y is the actual values of the target variable. In our case we will focus on unit sales only and neglect dollar sales since from the EDA we observed very strong multi-collinearity between these two variables. 3. In the EDA phase, missing values have been identified for few particular Brands were missing. After thorough analysis on these brands, we have already imputed them. In addition to the above pointed out pre-model work, question and model specific data preparation will be listed wherever necessary throughout the notebook.

Forecasting Models Considered¶

  1. Prophet:

    • Prophet was chosen for its ability to handle time series data with multiple seasonalities and holidays, which are common in sales forecasting.
    • In this project, Prophet is suitable for capturing complex sales patterns influenced by the innovative product's attributes along with other factors like seasonal trends, promotions, and holidays specific to the beverage industry.
    • This makes it a suitable choice for forecasting sales for our use case.
  2. ARIMA (AutoRegressive Integrated Moving Average):

    • ARIMA was selected due to its capability to model and forecast stationary time series data with autoregressive and moving average components.
    • In the context of this project, ARIMA is useful for capturing short term sales trends and fluctuations, providing insights into immediate sales variations.
    • This can particularly be helpful for Q2 where we are asked to forecast for only 2 weeks before and after Easter.
  3. SARIMA (Seasonal AutoRegressive Integrated Moving Average):

    • SARIMA was used to extend the capabilities of ARIMA by incorporating seasonality into the forecasting model.
    • Given the seasonal nature of sales for Swire and in general for the beverage industry, SARIMA allows for capturing and forecasting seasonal patterns and trends, enhancing the accuracy of sales predictions.
    • This helps identify and forecast recurring patterns in sales data, making it a very handy option given the uncertainties around matching historic products.
  4. Exponential Smoothing:

    • This model offers flexibility as it is a high performing generic option for forecasting with applications ranging from short term fluctuations to long term trends and seasonal patterns.
    • Moreover, these models are advantageous for their simplicity and ease of implementation.

In summary, Prophet, ARIMA, SARIMA, and Exponential Smoothing models were chosen and will be used depending on the question and the dataset that we need to deal with. Having multiple models forecast (wherever feasible) on the same data is what we will be doing for most of the questions. Best model will be considered based on its performance on the test set and more importantly having lower error metrics.

In [2]:
# importing libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

# Setting the warnings to be ignored
warnings.filterwarnings('ignore')
In [6]:
#market_data = pd.read_csv(r"/content/drive/MyDrive/FACT_MARKET_DEMAND_Ref.csv")
import pandas as pd
market_demand = pd.read_csv("FACT_MARKET_DEMAND-001.csv")

Data Preparation: Let's impute with the NA's in Caloric Segment with Regular type as that is the mode. The explanation for this as already been covered in the EDA were found that all the missing values were from three Brands only and all the products from these brands were Null for the caloric segment. Upon digging deeper on the other aspects of the brand, it was concluded that Regular which is also a majority in the dataset is the right imputation strategy.

In [7]:
market_demand['CALORIC_SEGMENT'].fillna(market_demand['CALORIC_SEGMENT'].mode()[0], inplace=True)
print("\nMissing value imputation check:")
print(market_demand.isnull().sum().sum())
Missing value imputation check:
0

Q1

Item Description: Diet Smash Plum 11Small 4One

  • Caloric Segment: Diet
  • Market Category: SSD
  • Manufacturer: Swire-CC
  • Brand: Diet Smash
  • Package Type: 11Small 4One
  • Flavor: ‘Plum’

Which 13 weeks of the year would this product perform best in the market? What is the forecasted demand, in weeks, for those 13 weeks?

Begining filtering by Flavor:

In [8]:
# filtering for Plum
plum = market_demand[(market_demand['ITEM'].str.contains('PLUM', case=False, regex=True))]

Adding a week number column:

In [9]:
# Assuming 'DATE' column is in datetime format
plum['DATE'] = pd.to_datetime(plum['DATE'])

# Find the earliest date
earliest_date = plum['DATE'].min()

# Calculate the week numbers from the earliest date
plum['WEEK_NUMBER'] = (plum['DATE'] - earliest_date).dt.days // 7 + 1
In [10]:
# Aggregate data by week number
weekly_sales = plum.groupby('WEEK_NUMBER')['UNIT_SALES'].sum()

# Plot time series
plt.figure(figsize=(18, 6))
weekly_sales.plot(marker='o', color='blue', linestyle='-')
plt.title('Weekly Unit Sales')
plt.xlabel('Week Number')
plt.ylabel('Total Unit Sales')
plt.grid(True)
plt.show()

Plum appears to have good sales during the beginining and end weeks of the year. Lets also filter by Swire now and look at a holistic chart for sales all the years for Plum + Swire:

In [11]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.signal import argrelextrema

# Assuming 'plum' dataframe contains 'WEEK_NUMBER' and 'UNIT_SALES' columns

# Aggregate data by week number
weekly_sales = plum[plum['MANUFACTURER'] == 'SWIRE-CC'].groupby('WEEK_NUMBER')['UNIT_SALES'].sum()

# Find local maxima (peaks)
peaks_index = argrelextrema(weekly_sales.values, comparator=lambda x, y: x > y, order=5)[0]
peaks = weekly_sales.iloc[peaks_index]

# Plot time series with peaks
plt.figure(figsize=(17, 8))
weekly_sales.plot(marker='o', color='blue', linestyle='-', label='Total Unit Sales')
peaks.plot(marker='o', color='red', linestyle='', markersize=8, label='Peaks')

# Annotate peaks with week number
for week, sales in peaks.items():
    plt.annotate(f'Week {week}\n({sales} units)', xy=(week, sales), xytext=(-20, 20), textcoords='offset points',
                 arrowprops=dict(arrowstyle='->', color='black'))



plt.title('Weekly Unit Sales with Peaks')
plt.xlabel('Week Number')
plt.ylabel('Total Unit Sales')
plt.grid(True)
plt.legend()
plt.show()

Looks like peaks are approximately 13 week gap and Swire might be restocking the shelves.

Lets group by weeks of the year to gauge sales with time of the year and dive deeper by computing sales by aggregating sales data every 13 weeks:

In [27]:
plum['13_WEEK_INTERVAL'] = (plum['WEEK_NUMBER'] - 1) // 13 + 1
In [ ]:
import matplotlib.pyplot as plt

# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns

# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']

# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]

# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
             xy=(max_interval, max_value),
             xytext=(80, -20),
             textcoords='offset points',
             arrowprops=dict(facecolor='red', arrowstyle='->'),
             fontsize=10,
             color='red',
             ha='center')

plt.tight_layout()
plt.show()

image.png

In [ ]:
import matplotlib.pyplot as plt

# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns

# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']

# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]

# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
             xy=(max_interval, max_value),
             xytext=(80, -20),
             textcoords='offset points',
             arrowprops=dict(facecolor='red', arrowstyle='->'),
             fontsize=10,
             color='red',
             ha='center')

plt.tight_layout()
plt.show()

from the plot above using Plum + Swire data, we see that week 22-34 of the year are the highest from the historic data.

Prophet Model

Data Preparation for Modelling

  1. Missing values already imputed.
  2. Necessary data has been extracted by filtering.
  3. Since, we will have many data points on a daily basis, lets aggregate data by DAY so that every day consitiutes to one single point in the plot.
  4. Prophet model will need the inputs to be formatted as 'ds' and 'y' columns only where ds is the date column and y is the actual values of the target variable.
  5. Model will be validated using train-test split by verifying performance of the model trained on the train set, on the test set.
In [112]:
swire_plum = plum[plum['MANUFACTURER'] == 'SWIRE-CC']
swire_plum= swire_plum.sort_values('DATE')

swire_plum_daily = swire_plum.groupby('DATE')['UNIT_SALES'].sum().reset_index()
swire_plum_daily.head(2)
Out[112]:
DATE UNIT_SALES
0 2020-12-05 1961.0
1 2020-12-12 1625.0

Model Validation¶

In [113]:
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
import pandas as pd


# Convert DATE column to datetime if not already in datetime format
swire_plum_daily['DATE'] = pd.to_datetime(swire_plum_daily['DATE'])

# Sort the data by date
swire_plum_daily = swire_plum_daily.sort_values(by='DATE')

# Calculate the index for partitioning the data
partition_index = int(len(swire_plum_daily) * 0.8)

# Split the data into train and test sets
train_data = swire_plum_daily.iloc[:partition_index]
test_data = swire_plum_daily.iloc[partition_index:]

# Rename columns as per Prophet's requirements
train_data = train_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
test_data = test_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

# Print the lengths of train and test dataframes
print("Length of Train Data:", len(train_data))
print("Length of Test Data:", len(test_data))

# Print the start and end dates of train and test dataframes
print("Start Date of Train Data:", train_data['ds'].min())
print("End Date of Train Data:", train_data['ds'].max())
print("Start Date of Test Data:", test_data['ds'].min())
print("End Date of Test Data:", test_data['ds'].max())
Length of Train Data: 119
Length of Test Data: 30
Start Date of Train Data: 2020-12-05 00:00:00
End Date of Train Data: 2023-03-11 00:00:00
Start Date of Test Data: 2023-03-18 00:00:00
End Date of Test Data: 2023-10-28 00:00:00
In [ ]:
# Initialize and fit the Prophet model
p_model = Prophet()
p_model.fit(train_data)

# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=34, freq='W')

# Make predictions
forecast = p_model.predict(future)

# Plot the forecast
plot_plotly(p_model, forecast)

image.png

In [ ]:
plot_components_plotly(p_model, forecast)

image.png

we observe that while the sales of plum for swire dipped in 2022, it bounces back in 2023 changing the trend upwards. The yearly chart above shows peaks during March and August.

Lets plot a chart that shows how the model performed on the train and test sets:

In [115]:
pred = forecast[['ds','yhat']]
In [116]:
import matplotlib.pyplot as plt

# Plot test data
plt.figure(figsize=(12, 6))
plt.plot(train_data['ds'], train_data['y'], label='Actual Train Set', color='blue')

plt.plot(test_data['ds'], test_data['y'], label = 'Actual Test Set',color='lightblue')

# Plot forecast data
plt.plot(forecast['ds'], forecast['yhat'], label='Forecast', color='red')


# Add labels and title
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Actual vs Forecast')
plt.legend()

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Show plot
plt.show()

Model is unable to capture sharp spikes in the test set, but does a very fair job in capturing the trend and average revenue sales on a weekly basis. This is the expectation in fact. If the model is able to capture and respond to spikes, then that will be a case of overfitting and any outliers can do more harm than good for Swire.

In [131]:
# Function to calculate the scores

from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
def get_scores(actual, predicted):
# Calculate errors
    mae = mean_absolute_error(actual, predicted)
    mse = mean_squared_error(actual, predicted)
    rmse = sqrt(mse)
    #percentage_diff = (np.abs(actual - predicted).sum()) /( np.abs(actual).mean())
    
    # Combine actual and predicted values into one DataFrame
    mape_df = pd.DataFrame({'actual': actual.values, 'predicted': predicted.values})

# Calculate absolute percentage error
    mape_df['absolute_percentage_error'] = np.abs((mape_df['actual'] - mape_df['predicted']) / mape_df['actual']) * 100

# Calculate MAPE
    mape = mape_df['absolute_percentage_error'].mean()
    
# Calculate MAPE
    #mape = percentage_diff.mean()
# Calculate "Accuracy" Percentage
    accuracy_percentage = 100 - mape
# Print metrics
#     print(f"MAE: {mae}")
#     print(f"MSE: {mse}")
#     print(f"RMSE: {rmse}")
#     print(f"MAPE: {mape}%")
#     print(f"Direct 'Accuracy' Percentage: {accuracy_percentage}%")
    return pd.Series(data={'mae':mae, 'mse':mse, 'rmse':rmse, 'mape':mape, 'direct_accuracy':accuracy_percentage}, index=['mae', 'mse', 'rmse','mape','direct_accuracy'])

Error Metrics¶

In [118]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Assuming mulberry_diet_sparkling_validation is your DataFrame
swire_plum_sparkling_validation = swire_plum_daily.copy()

# Convert Date to datetime variable
swire_plum_sparkling_validation['DATE'] = pd.to_datetime(swire_plum_sparkling_validation['DATE'])

# Group by date to ensure only one record for each date
swire_plum_sparkling_validation = swire_plum_sparkling_validation.groupby('DATE')['UNIT_SALES'].sum().reset_index()

# Sort the data by date
swire_plum_sparkling_validation = swire_plum_sparkling_validation.sort_values(by='DATE')

# Calculate the index for partitioning the data
partition_index = int(len(swire_plum_sparkling_validation) * 0.8)

# Split the data into train and test sets
train_data = swire_plum_sparkling_validation.iloc[:partition_index]
test_data = swire_plum_sparkling_validation.iloc[partition_index:]

# Renaming for Prophet requirements
train_data = train_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
test_data = test_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

p_model = Prophet()
p_model.fit(train_data)

# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=0, freq='W')

# Make predictions
forecast = p_model.predict(future)

# Extracting the date and unit sales from forecast
pred = forecast[['ds','yhat']]
22:00:01 - cmdstanpy - INFO - Chain [1] start processing
22:00:01 - cmdstanpy - INFO - Chain [1] done processing
In [45]:
get_scores(test_data['y'], pred['yhat'])
Out[45]:
mae                  162.088814
mse                43781.580074
rmse                 209.240484
mape                  13.020970
direct_accuracy       86.979030
dtype: float64

The model is performing well with mape 13% and accuracy 86%

Lets compute 13 week aggregate sales:

In [55]:
earliest_date = forecast['ds'].min()

# Step 2: Calculate week numbers
forecast['WEEK_NUMBER'] = ((forecast['ds'] - earliest_date).dt.days // 7) % 52
weekly_forecast_data = forecast.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
In [56]:
# Sort the DataFrame by 'WEEK_NUMBER' if it's not already sorted
weekly_forecast_data = weekly_forecast_data.sort_values(by='WEEK_NUMBER')

# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data) < 13:
    print("Insufficient data to compute rolling sum.")
else:
    # Initialize an empty list to store the results
    rolling_sales_sum = []

    # Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
    for i in range(len(weekly_forecast_data) - 12):
        # Calculate the sum of UNIT_SALES for the current 13-week interval
        sales_sum = weekly_forecast_data.loc[i:i+12, 'yhat'].sum()
        # Construct the 13-week interval string
        interval = f"{weekly_data.iloc[i]['WEEK_NUMBER']}-{weekly_data.iloc[i+12]['WEEK_NUMBER']}"
        # Append the result to the list
        rolling_sales_sum.append((sales_sum, interval))

    # Create a new DataFrame from the results
    rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])

    # Display the DataFrame with the rolling sum and 13-week intervals
# print(rolling_sales_df)
In [57]:
import matplotlib.pyplot as plt

# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns

# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']

# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = round(sales_sum[max_index])
max_interval = interval[max_index]

# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
             xy=(max_interval, max_value),
             xytext=(70, 20),
             textcoords='offset points',
             arrowprops=dict(facecolor='red', arrowstyle='->'),
             fontsize=10,
             color='red',
             ha='center')

plt.tight_layout()
plt.show()

While historic data revealed week 22-34 fetching the highest sales, forecasted data shows that in 2024 week 35-47 period would fetch maximum sales for plum flavor for Swire. We see that the maximum values of 13 week aggregates for both historic and forecast are nearly the same (73k).

PLUM + SWIRE + DIET

In [ ]:
plum_swire_diet = swire_plum[swire_plum['CALORIC_SEGMENT'] == 'DIET/LIGHT']
plum_swire_diet_weekly_sales = plum_swire_diet.groupby('DATE')['UNIT_SALES'].sum().reset_index()
plum_swire_diet_weekly_sales.head(2)
In [ ]:
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
import pandas as pd


# Convert DATE column to datetime if not already in datetime format
plum_swire_diet_weekly_sales['DATE'] = pd.to_datetime(plum_swire_diet_weekly_sales['DATE'])

# Sort the data by date
plum_swire_diet_weekly_sales = plum_swire_diet_weekly_sales.sort_values(by='DATE')

plum_swire_diet_model_data = plum_swire_diet_weekly_sales.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

# Initialize and fit the Prophet model
p_model = Prophet()
p_model.fit(plum_swire_diet_model_data)

# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=52, freq='W')

# Make predictions
forecast = p_model.predict(future)

# Plot the forecast
plot_plotly(p_model, forecast)

image.png

This is interesting, diet sales dipped in 2021 and from then on maintained a side ways to slighly negative trend.

In [127]:
earliest_date = forecast['ds'].min()

# Step 2: Calculate week numbers
forecast['WEEK_NUMBER'] = ((forecast['ds'] - earliest_date).dt.days // 7) % 52
weekly_forecast_data = forecast.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()

# Step 4: Create a time series plot
plt.figure(figsize=(12, 4))
plt.plot(weekly_forecast_data['WEEK_NUMBER'], weekly_forecast_data['yhat'], marker='o', linestyle='-')
plt.xlabel('Week Number')
plt.ylabel('Unit Sales')
plt.title('Weekly Total Unit Sales')
plt.grid(True)
plt.show()

sales at the beginining of the year appear to be the highest. Lets aggregate on a 13 week basis to identify best consectuive 13 weeks:

In [128]:
# Sort the DataFrame by 'WEEK_NUMBER' if it's not already sorted
weekly_forecast_data = weekly_forecast_data.sort_values(by='WEEK_NUMBER')

# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data) < 13:
    print("Insufficient data to compute rolling sum.")
else:
    # Initialize an empty list to store the results
    rolling_sales_sum = []

    # Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
    for i in range(len(weekly_forecast_data) - 12):
        # Calculate the sum of UNIT_SALES for the current 13-week interval
        sales_sum = weekly_forecast_data.loc[i:i+12, 'yhat'].sum()
        # Construct the 13-week interval string
        interval = f"{weekly_data.iloc[i]['WEEK_NUMBER']}-{weekly_data.iloc[i+12]['WEEK_NUMBER']}"
        # Append the result to the list
        rolling_sales_sum.append((sales_sum, interval))

    # Create a new DataFrame from the results
    rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])

    # Display the DataFrame with the rolling sum and 13-week intervals
    #print(rolling_sales_df)
In [129]:
import matplotlib.pyplot as plt

# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns

# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']

# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = round(sales_sum[max_index])
max_interval = interval[max_index]

# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
             xy=(max_interval, max_value),
             xytext=(70, 20),
             textcoords='offset points',
             arrowprops=dict(facecolor='red', arrowstyle='->'),
             fontsize=10,
             color='red',
             ha='center')

plt.tight_layout()
plt.show()

It is clear from the plot above that week 0-12 is the clear winner after filtering for diet caloric segment on top of plum and swire. Might indicate on how people on a new year take resolution on cutting down on sugars and as year progresses go back to their habits.

Now lets filter the forecasted data to last 1 year and extract the first 13 weeks of the year.

In [130]:
# week wise sales for best 13 weeks
best_13_weeks_q1 = forecast[(forecast['WEEK_NUMBER'] <= 12) & (forecast['ds'] > '2023-02-25')]
best_13_weeks_q1.head(2)
Out[130]:
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat WEEK_NUMBER
145 2023-12-03 752.538260 476.851987 886.220095 750.513944 754.733644 -61.95856 -61.95856 -61.95856 -61.95856 -61.95856 -61.95856 0.0 0.0 0.0 690.579701 0
146 2023-12-10 749.318607 435.234791 850.450446 746.771570 752.273543 -89.59627 -89.59627 -89.59627 -89.59627 -89.59627 -89.59627 0.0 0.0 0.0 659.722337 1
In [131]:
# Compute total sales for all 13 weeks
total_sales = best_13_weeks_q1['yhat'].sum()

# Compute lower and upper bounds for estimate
lower_bound = best_13_weeks_q1['yhat_lower'].sum()
upper_bound = best_13_weeks_q1['yhat_upper'].sum()

# Calculate confidence interval
confidence_interval = ((upper_bound - lower_bound) / total_sales) * 100

# Display the results
print("Total Sales for 13 weeks:", total_sales)
print("Lower Bound Estimate:", lower_bound)
print("Upper Bound Estimate:", upper_bound)
print("Confidence Interval (%):", confidence_interval)
Total Sales for 13 weeks: 9057.187762419222
Lower Bound Estimate: 6291.951059331268
Upper Bound Estimate: 11799.518942786954
Confidence Interval (%): 60.80880763351411

Therefore, the best 13 weeks are 0-12 weeks with a total sales forecast of 9058 units. The above lower and upper bound represent the lower and upper limits of the estimated sales range derived from the Prophet model.

CI % suggests that there is a 60.81% level of confidence that the true sales value falls within the range defined by the lower and upper bounds. Lets visualize through a plot:

In [132]:
import matplotlib.pyplot as plt

# Extract week numbers and corresponding sales estimates
week_numbers = best_13_weeks_q1['WEEK_NUMBER']
sales_estimates = best_13_weeks_q1['yhat']
lower_bounds = best_13_weeks_q1['yhat_lower']
upper_bounds = best_13_weeks_q1['yhat_upper']

# Plot the time series graph with confidence interval
plt.figure(figsize=(12, 6))
plt.plot(week_numbers, sales_estimates, color='blue', label='Sales Estimate')
plt.fill_between(week_numbers, lower_bounds, upper_bounds, color='skyblue', alpha=0.4, label='Confidence Interval')
plt.xlabel('Week Number')
plt.ylabel('Total Sales')
plt.title('Week-wise Distribution of Total Sales with Confidence Interval')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Accounting for remaining parameters (Category & Brand) - Historic Ratio Method

Considerations: For demand specific qs, neglecting package and market key as it doesn't add any value but simply increases complexity. Moreover, package type given in question 1 is completely new one in the entire dataset and cannot be extrapolated and even nearest neighbors methods won't be of use.

Having said that, lets compare sales statistics of category and brand for this innovative product with what we have after applying previous filters. This comparative analysis will offer insights into potential adjustments to sales proportions.

In [134]:
plum_swire_diet['CATEGORY'].value_counts(normalize=True)
Out[134]:
ENERGY             0.998592
SPARKLING WATER    0.001126
SSD                0.000282
Name: CATEGORY, dtype: float64

Since nearly all the data we obtained for forecasts comes for Energy category while the this plum innovative product is of SSD category, lets compare how the unit sales fared for both these categories in the entire dataset:

In the overall dataset, 69.2% of unit sales compared to Energy.

In [136]:
# Filter DataFrame for 'SSD' category
ssd_sales = market_demand[market_demand['CATEGORY'] == 'SSD']['UNIT_SALES'].sum()

# Filter DataFrame for 'ENERGY' category
energy_sales = market_demand[market_demand['CATEGORY'] == 'ENERGY']['UNIT_SALES'].sum()

print(" % of unit sales for SSD category compared to SSD + ENERGY category:", ssd_sales/(energy_sales+ssd_sales))
print(" % of unit sales for ENERGY category compared to SSD + ENERGY category:", energy_sales/(energy_sales+ssd_sales))
 % of unit sales for SSD category compared to SSD + ENERGY category: 0.6923682683459191
 % of unit sales for ENERGY category compared to SSD + ENERGY category: 0.30763173165408086
In [137]:
import pandas as pd

# Filter DataFrame for 'SSD' category
ssd_data = market_demand[market_demand['CATEGORY'] == 'SSD']['UNIT_SALES'].describe()

# Filter DataFrame for 'ENERGY' category
energy_data = market_demand[market_demand['CATEGORY'] == 'ENERGY']['UNIT_SALES'].describe()

# Concatenate the DataFrames horizontally
combined_data = pd.concat([ssd_data, energy_data], axis=1)
combined_data.columns = ['SSD', 'ENERGY']  # Rename columns for clarity
In [138]:
combined_data
Out[138]:
SSD ENERGY
count 1.291228e+07 5.932087e+06
mean 2.088707e+02 2.020072e+02
std 1.026061e+03 8.425682e+02
min 4.000000e-02 4.000000e-02
25% 1.500000e+01 7.000000e+00
50% 5.300000e+01 3.300000e+01
75% 1.560000e+02 1.370000e+02
max 9.677600e+04 7.165700e+04

From the above three reuslts, we see the correlation between SSD and ENERGY categories. About 68.5% of the sales between these two have come from SSD signifying that 70% of the records are of SSD. Suprisingly, even the total sales is the same proportion (69%). The above statistics also suggests that the mean sales is nearly same for both (around 200).

Through all this, we can safely assume SSD is more or less same as ENERGY and there is no need to change our sales estimates.

Now let's look at Brand:

In [139]:
plum_swire_diet['BRAND'].value_counts(normalize=True)
Out[139]:
VENOMOUS BLAST         0.998592
CUPADA ARID            0.001126
DIET MITE PURE ZERO    0.000282
Name: BRAND, dtype: float64

Nearly all the sales from the data used for forecasting is from VENOMOUS BLAST brand. Comparing sales between this brand and the brand we have in the question DIET SMASH:

In [141]:
# Filter DataFrame for 'SSD' category
vb_brand_sales = market_demand[market_demand['BRAND']=='VENOMOUS BLAST']['UNIT_SALES'].sum()

# Filter DataFrame for 'ENERGY' category
dietsmash_sales = market_demand[market_demand['BRAND']=='DIET SMASH']['UNIT_SALES'].sum()

print(" % of unit sales for DIET SMASH Brand compared to other two:", dietsmash_sales/(vb_brand_sales+dietsmash_sales))
print(" % of unit sales for VENOMOUS BLAST Brand compared to other two:", vb_brand_sales/(vb_brand_sales+dietsmash_sales))
 % of unit sales for DIET SMASH Brand compared to other two: 0.11897748583594121
 % of unit sales for VENOMOUS BLAST Brand compared to other two: 0.8810225141640589
In [142]:
# Filter DataFrame for 'SSD' category
vb_data =market_demand[market_demand['BRAND']=='VENOMOUS BLAST']['UNIT_SALES'].describe()

dietsmash_data= market_demand[market_demand['BRAND']=='DIET SMASH']['UNIT_SALES'].describe()

# Concatenate the DataFrames horizontally
combined_data = pd.concat([dietsmash_data, vb_data], axis=1)
combined_data.columns = ['DIET SMASH', 'VENOMOUS BLAST']  # Rename columns for clarity

combined_data
Out[142]:
DIET SMASH VENOMOUS BLAST
count 17483.000000 51756.000000
mean 27.975093 69.975983
std 30.263642 246.676148
min 1.000000 1.000000
25% 9.000000 6.000000
50% 21.000000 15.000000
75% 38.000000 40.250000
max 698.000000 3701.000000

There is huge difference in standard deviation, mean and max values. There seems to outliers in the data pushing the mean higher for VB brand.

Presence of potential Outliers in VENOMOUS BLAST Brand:

Plots for both the brands clearly shows the presence of several high purchase orders or potential outliers for VB Brand. However, we see a very similar value for 25%, 50% and even 75% quartile values from the table above. These outliers have pushed the average unit sales higher for VB.

Let's dive into these outlying points:

In [143]:
import matplotlib.pyplot as plt

# Filter the data for 'VENOMOUS BLAST' and 'DIET SMASH' brands
venomous_blast_data = market_demand[market_demand['BRAND'] == 'VENOMOUS BLAST']
diet_smash_data = market_demand[market_demand['BRAND'] == 'DIET SMASH']

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Plot for 'VENOMOUS BLAST' brand
axes[0].hist(venomous_blast_data['UNIT_SALES'], bins=30, color='skyblue', edgecolor='black')
axes[0].set_title('Unit Sales Distribution for VENOMOUS BLAST Brand')
axes[0].set_xlabel('Unit Sales')
axes[0].set_ylabel('Frequency')
axes[0].grid(True)

# Plot for 'DIET SMASH' brand
axes[1].hist(diet_smash_data['UNIT_SALES'], bins=30, color='lightgreen', edgecolor='black')
axes[1].set_title('Unit Sales Distribution for DIET SMASH Brand')
axes[1].set_xlabel('Unit Sales')
axes[1].set_ylabel('Frequency')
axes[1].grid(True)

# Adjust layout
plt.tight_layout()
plt.show()

Plot confirms presence of outliers. Diving deep:

In [144]:
import numpy as np
venomous_blast_data = market_demand[market_demand['BRAND'] == 'VENOMOUS BLAST']

#  percentile values
percentiles = np.arange(75, 96, 5)  # Percentiles from 75 to 95 in steps of 5
percentiles = np.append(percentiles, np.arange(96, 101, 1))  # Percentiles from 96 to 100 in steps of 1
percentile_values = np.percentile(venomous_blast_data['UNIT_SALES'], percentiles)

percentile_table = pd.DataFrame({'Percentile': percentiles, 'Value': percentile_values})
percentile_table
Out[144]:
Percentile Value
0 75 40.25
1 80 52.00
2 85 72.00
3 90 115.00
4 95 231.00
5 96 299.00
6 97 402.00
7 98 594.90
8 99 1612.80
9 100 3701.00

Table for VB above proves to us that even until 98 percentile, the values closely follow and match with the maximum value for DIET SMASH which is about 700. However, the 99th and 100th percentile values of 1600 and 3700 are ridiculously high in the scope of this dataset.

In this context, these outliers account for VB's higher mean value and hence we will proceed to consider average sales values to be more or less same for both the brands.

Summary

  • We filtered the dataset by extracting data corresponding to Plum + Swire + Diet and developed Prophet model after validating model performance over unseen data (test set).
  • Model MAPE error is .. over the test set.

  • From the forecast obtained from the model, we observe best 13 weeks are week 0 to week 12 where week 0 starts from the 1st week of December as we have taken minimum data in the dataset as the 1st week.

  • The obtained forecast for the best 13 weeks are about 9060 in total unit sales.

  • Using Historic Ratio method, we analyzed and comapred the remaining product attributes and found not much of a difference. Therefore the forecasted sales figure of 9060 will remain the final result for this innovative product.

Q2 ¶

Item Description: Sparkling Jacceptabletlester Avocado 11Small MLT

  • Caloric Segment: Regular
  • Market Category: SSD
  • Manufacturer: Swire-CC
  • Brand: Sparkling Jacceptabletlester
  • Package Type: 11Small MLT
  • Flavor: ‘Avocado’

Swire plans to release this product 2 weeks prior to Easter and 2 weeks post Easter. What will the forecasted demand be, in weeks, for this product?

Data Preparation¶

In [3]:
avocado = market_demand[(market_demand['ITEM'].str.contains('AVOCADO', case=False, regex=True))]
avocado_regular = avocado[(avocado['CALORIC_SEGMENT'].str.contains('REGULAR', case=False, regex=True, na=False))]
avocado_regular_ssd = avocado_regular[(avocado_regular['CATEGORY'].str.contains('SSD', case=False, regex=True))]
avocado_regular_ssd_swire = avocado_regular_ssd[(avocado_regular_ssd['MANUFACTURER'].str.contains('SWIRE', case=False, regex=True))]
avocado_regular_ssd_swire_small = avocado_regular_ssd_swire[(avocado_regular_ssd_swire['PACKAGE'].str.contains('SMALL', case=False, regex=True))]

We're able to find avocado flavor in the data. Along with it is filtered with it's caloric segment, market category, and manufacturer. We have filtered the package type to small which is only closest we have given that its a new packaging type.

In [4]:
df = avocado_regular_ssd_swire_small[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)
df = df.asfreq('W-SAT')
In [5]:
plt.plot(df)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
Out[5]:
Text(0, 0.5, 'UNIT_SALES')

There are NAN in some weeks. as we dig deeper we see that there are no sale on the month of august 2023.

Before building model, lets validate by creating train and test sets:

In [6]:
# Train and Test periods
from datetime import datetime

s_date = '2023-02-25'
e_date = '2023-08-26'
df = df.loc[df.index <= e_date]
train_period = df.loc[df.index < s_date]
test_period = df.loc[(df.index >= s_date) & (df.index <= e_date)]


start_forecast = datetime.strptime(s_date, '%Y-%m-%d').date()
end_forecast = datetime.strptime(e_date, '%Y-%m-%d').date()
In [7]:
easter_start_2023 = '2023-04-01'
easter_end_2023 = '2023-04-22'

easter_weeks_test = df.loc[(df.index >= easter_start_2023) & (df.index <= easter_end_2023)]

Model Performance Metrics Function¶

We will create and resuse this function to measure each models performance with the data. This will include MAE, MSE, RMSE, MAPE and direct accuracy percentage.

In [2]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np

def get_scores(actual, predicted):### Modelling Process
    # Calculate errors
    mae = mean_absolute_error(actual, predicted)
    mse = mean_squared_error(actual, predicted)
    rmse = sqrt(mse)
    percentage_diff = np.abs((actual - predicted) / actual) * 100

    # Calculate MAPE
    mape = percentage_diff.mean()

    # Calculate "Accuracy" Percentage
    accuracy_percentage = 100 - mape

    # Print metrics
    print(f"MAE: {mae}")
    print(f"MSE: {mse}")
    print(f"RMSE: {rmse}")
    print(f"MAPE: {mape}%")
    print(f"Direct 'Accuracy' Percentage: {accuracy_percentage}%")

    return pd.Series(data={'mae':mae, 'mse':mse, 'rmse':rmse, 'mape':mape, 'direct_accuracy':accuracy_percentage}, index=['mae', 'mse', 'rmse','mape','direct_accuracy'])

Modeling Process

Arima¶

ARIMA is a statistical model used for time series analysis to forecast future data points by leveraging past data. It combines three main aspects: autoregression (AR), differencing (I) to make the time series stationary, and moving average (MA). The AR part exploits the relationship between an observation and a number of lagged observations, the I part involves differencing the data to achieve stationarity, and the MA part models the error of the observation as a combination of previous error terms.

In [ ]:
from statsmodels.tsa.arima.model import ARIMA

# ARIMA Model fitting
order = (1, 1, 1)  # Still (p, d, q)

model = ARIMA(train_period['UNIT_SALES'], order=order)
model_fit = model.fit()

# Forecasting
forecast = model_fit.predict(start=start_forecast,
                             end=end_forecast)

arima_forecast = forecast
In [10]:
arima_score = get_scores(test_period.squeeze(), arima_forecast)
MAE: 13878.577969407064
MSE: 333660932.83898854
RMSE: 18266.3880622029
MAPE: 13.029688220376284%
Direct 'Accuracy' Percentage: 86.97031177962371%

Metrics for Easter weeks:

In [11]:
arima_easter_score = get_scores(easter_weeks_test.squeeze(), arima_forecast.loc[(arima_forecast.index >= easter_start_2023) & (arima_forecast.index <= easter_end_2023)])
MAE: 16694.848477498203
MSE: 332614641.0282627
RMSE: 18237.72576359955
MAPE: 15.726140646868652%
Direct 'Accuracy' Percentage: 84.27385935313134%
In [12]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(easter_weeks_test['UNIT_SALES'], label='Actual Sales')
plt.plot(arima_forecast.loc[(arima_forecast.index >= easter_start_2023) & (arima_forecast.index <= easter_end_2023)], label='Forecast', color='red')
plt.xticks(easter_weeks_test.index)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

Plot above shows the actual vs forecast of the sales for the Easter weeks. Since, the time period is small, looks like ARIMA is unable to capture details.

SARIMA¶

SARIMA extends ARIMA by explicitly accommodating and modeling seasonal effects in time series data. It includes additional seasonal elements on top of the AR, I, and MA components. SARIMA is characterized by its ability to model both non-seasonal and seasonal components of the time series data, making it more versatile than ARIMA for data with clear seasonal patterns, such as sales data around specific holidays or events. It incorporates additional parameters to handle seasonality, which are seasonal AR, seasonal differencing, and seasonal MA components, allowing it to capture seasonal fluctuations effectively, making it ideal for products with seasonal demand.

In [ ]:
from statsmodels.tsa.statespace.sarimax import SARIMAX


# SARIMA Model fitting
order = (1, 1, 1)  # (p, d, q)
seasonal_order = (1, 1, 1, 52)  # (P, D, Q, s)

model = SARIMAX(train_period['UNIT_SALES'], order=order, seasonal_order=seasonal_order)
model_fit = model.fit()

# Forecasting
forecast = model_fit.predict(start=start_forecast.strftime('%Y-%m-%d'),
                             end=end_forecast.strftime('%Y-%m-%d'))
sarima_forecast = forecast
In [14]:
sarima_score = get_scores(test_period.squeeze(), sarima_forecast)
MAE: 10543.398089743641
MSE: 170655919.02185866
RMSE: 13063.533940777996
MAPE: 9.790936856842896%
Direct 'Accuracy' Percentage: 90.20906314315711%

Metrics for Easter Weeks:

In [15]:
sarima_easter_score = get_scores(easter_weeks_test.squeeze(), sarima_forecast.loc[(sarima_forecast.index >= easter_start_2023) & (sarima_forecast.index <= easter_end_2023)])
MAE: 19749.376548111453
MSE: 497937909.1387423
RMSE: 22314.522382043993
MAPE: 19.265721342075864%
Direct 'Accuracy' Percentage: 80.73427865792414%
In [16]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(easter_weeks_test['UNIT_SALES'], label='Actual Sales')
plt.plot(sarima_forecast.loc[(sarima_forecast.index >= easter_start_2023) & (sarima_forecast.index <= easter_end_2023)], label='Forecast', color='red')
plt.xticks(easter_weeks_test.index)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

SARIMA surely appears to react better to the details of the actual unit sales changes over ARIMA.

Prophet¶

Prophet is a forecasting tool designed by Facebook for handling time series data that displays patterns on different time scales such as yearly, weekly, and daily. It is especially useful for data with strong seasonal effects and several seasons of historical data. Prophet works by fitting nonlinear trends with yearly, weekly, and daily seasonality, plus holiday effects. It is robust to missing data and shifts in the trend, and typically requires no manual tuning of parameters. The model accommodates seasonality through Fourier series and includes components for holidays and special events, making it well-suited for predicting demand for products around specific events or holidays, like Easter.

In [ ]:
from prophet import Prophet

df_mod = train_period.reset_index()
df_mod['DATE'] = pd.to_datetime(df_mod['DATE'])
df_mod.columns = ['ds', 'y']

model = Prophet(changepoint_prior_scale=0.1,seasonality_prior_scale=0.5, weekly_seasonality=True)
model.fit(df_mod)

# Forecast the next 52 weeks (1 year)
future = model.make_future_dataframe(periods=53, freq='W-SAT')
forecast = model.predict(future)

prophet_forecast = forecast[['ds','yhat']]
prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast.columns = ['DATE', 'yhat']
prophet_forecast.set_index('DATE', inplace=True)
prophet_forecast = prophet_forecast.asfreq('W-SAT')
prophet_forecast = prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)]
In [18]:
prophet_score = get_scores(test_period.squeeze(), prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)].squeeze())
MAE: 10354.142149221083
MSE: 178938113.5920231
RMSE: 13376.775156666987
MAPE: 9.585562452804588%
Direct 'Accuracy' Percentage: 90.4144375471954%

Easter week metrics

In [19]:
prophet_easter_score = get_scores(easter_weeks_test.squeeze(), prophet_forecast.loc[(prophet_forecast.index >= easter_start_2023) & (prophet_forecast.index <= easter_end_2023)].squeeze())
MAE: 13672.249500615759
MSE: 250724046.9852324
RMSE: 15834.268122816173
MAPE: 13.398864372818881%
Direct 'Accuracy' Percentage: 86.60113562718112%
In [ ]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=easter_weeks_test, x=easter_weeks_test.index, y='UNIT_SALES', label='Actual Sales')
sns.lineplot(data=prophet_forecast.loc[(prophet_forecast.index >= easter_start_2023) & (prophet_forecast.index <= easter_end_2023)], x=prophet_forecast.loc[(prophet_forecast.index >= easter_start_2023) & (prophet_forecast.index <= easter_end_2023)].index, y='yhat', label='Forecast', color='red')
plt.xticks(easter_weeks_test.index)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

image.png

Exponential Smoothing¶

Exponential Smoothing is a time series forecasting method for univariate data that applies smoothing factors to the observations, giving more weight to recent observations while not discarding older observations entirely. It encompasses simple exponential smoothing for data with no clear trend or seasonality, and extends to Holt’s linear trend model and Holt-Winters’ seasonal model, which can account for both trends and seasonality in the data. This method is straightforward and computationally efficient, making it a good choice for producing quick forecasts in situations where data patterns are reasonably consistent over time, but may struggle with data that has complex patterns or significant irregularities.

In [ ]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# ExponentialSmoothing Model fitting
model = ExponentialSmoothing(train_period['UNIT_SALES'], trend='add', seasonal='add', seasonal_periods=52).fit(smoothing_level=0.6, smoothing_slope=0.2, smoothing_seasonal=0.2)

# Forecasting

forecast_periods = ((end_forecast - start_forecast).days // 7) + 1

exponential_forecast = model.forecast(forecast_periods)

forecast_dates = pd.date_range(start=start_forecast, periods=forecast_periods, freq='W-SAT')
In [22]:
exponential_smoothing_score = get_scores(test_period.squeeze(), exponential_forecast)
MAE: 11515.003782552438
MSE: 176593000.4153336
RMSE: 13288.829911445688
MAPE: 11.473932683680282%
Direct 'Accuracy' Percentage: 88.52606731631971%

Easter week metrics:

In [23]:
exponential_smoothing_easter_score = get_scores(easter_weeks_test.squeeze(), exponential_forecast.loc[(exponential_forecast.index >= easter_start_2023) & (exponential_forecast.index <= easter_end_2023)])
MAE: 15194.8459825914
MSE: 339963988.23290277
RMSE: 18438.112382586856
MAPE: 15.957019954688251%
Direct 'Accuracy' Percentage: 84.04298004531175%
In [24]:
# Plotting the results

easter_start_2023 = pd.to_datetime(easter_start_2023)
easter_end_2023 = pd.to_datetime(easter_end_2023)

# Filter the dates
filtered_dates = forecast_dates[(forecast_dates >= easter_start_2023) & (forecast_dates <= easter_end_2023)]


plt.figure(figsize=(10, 6))
plt.plot(easter_weeks_test.index, easter_weeks_test['UNIT_SALES'], label='Actual Sales')
plt.plot(filtered_dates, exponential_forecast.loc[(exponential_forecast.index >= easter_start_2023) & (exponential_forecast.index <= easter_end_2023)], label='Forecast', color='red')
plt.xticks(easter_weeks_test.index)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

Performance Summary¶

In [25]:
pd.options.display.float_format = '{:.2f}'.format
q2_scores = pd.DataFrame({'Arima':arima_score, 'Sarima':sarima_score, 'Prophet':prophet_score, 'Exponential Smoothing':exponential_smoothing_score}).T
print(q2_scores)
                           mae          mse     rmse  mape  direct_accuracy
Arima                 13878.58 333660932.84 18266.39 13.03            86.97
Sarima                10543.40 170655919.02 13063.53  9.79            90.21
Prophet               10354.14 178938113.59 13376.78  9.59            90.41
Exponential Smoothing 11515.00 176593000.42 13288.83 11.47            88.53
In [26]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
fig.suptitle('Model Performance Metrics')

for i, column in enumerate(q2_scores.columns):
    row, col = divmod(i, 3)
    sns.barplot(ax=axes[row, col], x=q2_scores.index, y=q2_scores[column])
    axes[row, col].set_title(column)
    axes[row, col].set_ylabel('')

# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Based on the detailed metrics provided, the Prophet model is identified as the most fitting choice for this context. Notably, while its MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error) values are slightly higher than those of the SARIMA model, the differences are marginal. This slight increase in error metrics is counterbalanced by Prophet's MAPE (Mean Absolute Percentage Error) of 9.59, which is the lowest among the models evaluated, indicating its predictions are proportionately closer to actual values, making it highly effective in scenarios where percentage errors are more impactful than absolute errors. Moreover, Prophet's direct accuracy rate of 90.41% is the highest, signifying its unmatched precision in predicting future values directly. This level of accuracy is particularly valuable in applications where correct directional forecasting is crucial. Furthermore, Prophet's flexibility in handling seasonality and missing data, along with its ability to incorporate external variables, offers a robust modeling choice that adapts well to complex datasets. Thus, considering the balance between nuanced accuracy, error margin, and the ability to handle real-world data complexities, Prophet emerges as the superior model for forecasting tasks where adaptability and precision are paramount.

Let's take a look at how good is in predicting easter 2023.

In [27]:
q2_easter_scores = pd.DataFrame({'Arima':arima_easter_score, 'Sarima':sarima_easter_score, 'Prophet':prophet_easter_score, 'Exponential Smoothing':exponential_smoothing_easter_score}).T
print(q2_easter_scores)
                           mae          mse     rmse  mape  direct_accuracy
Arima                 16694.85 332614641.03 18237.73 15.73            84.27
Sarima                19749.38 497937909.14 22314.52 19.27            80.73
Prophet               13672.25 250724046.99 15834.27 13.40            86.60
Exponential Smoothing 15194.85 339963988.23 18438.11 15.96            84.04
In [28]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
fig.suptitle('Model Performance Metrics')

for i, column in enumerate(q2_easter_scores.columns):
    row, col = divmod(i, 3)
    sns.barplot(ax=axes[row, col], x=q2_scores.index, y=q2_scores[column])
    axes[row, col].set_title(column)
    axes[row, col].set_ylabel('')

# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Given the updated metrics, the Prophet model distinctly emerges as the premier choice for forecasting Easter 2023. This preference is underscored by its superior Mean Absolute Error (MAE) of 13672.25, which is the lowest among the models, indicating that its average deviation from actual values is minimal. This is particularly significant in forecasting scenarios where precision in predictions is paramount. The Prophet model also boasts a commendable Root Mean Squared Error (RMSE) of 15834.27, which, despite not being the absolute lowest, reflects its robustness in handling both small and large prediction errors effectively. Furthermore, its Mean Absolute Percentage Error (MAPE) stands at 13.40, underscoring its efficiency in maintaining proportionate accuracy across various magnitudes of data—a critical aspect for ensuring reliability in percentage terms.

The model's Direct Accuracy rate of 86.60% is another key factor in its favor, showcasing its exceptional ability to predict the direction of trends accurately. This attribute is essential in applications where understanding the directional movement of data points is more critical than the exact values.

Prophet's standout performance is not just in its error metrics but also in its inherent features that cater well to forecasting tasks like Ea2023. It adeptly handles seasonality and trends over time, making it exceptionally suitable for predicting events with complex patterns. Additionally, Prophet's flexibility in integrating holidays and special events into its forecasts further cements its suitability for accurately predicting specific occasions like Easter, which moves annually.

Considering these attributes, the Prophet model offers a harmonious blend of accuracy, reliability, and adaptability, making it the most fitting choice for forecasting events with significant annual variations like r 2023. Its ability to deliver precise forecasts despite the complexities of seasonal patterns and holiday impacts renders it a robust tool for such predictive tasks.

Predicting Easter 2024 using Prophet¶

In [29]:
df = avocado_regular_ssd_swire_small[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)
df = df.loc[df.index <= "2023-08-26"]
df = df.asfreq('W-SAT')
In [30]:
start_forecast = datetime.strptime("2024-03-23", '%Y-%m-%d').date()
end_forecast = datetime.strptime("2024-04-13", '%Y-%m-%d').date()
In [ ]:
from prophet import Prophet

df_mod = df.reset_index()
df_mod['DATE'] = pd.to_datetime(df_mod['DATE'])
df_mod.columns = ['ds', 'y']

model = Prophet(changepoint_prior_scale=0.1,seasonality_prior_scale=0.5, weekly_seasonality=True)
model.fit(df_mod)

# Forecast the next 52 weeks (1 year)
future = model.make_future_dataframe(periods=53, freq='W-SAT')
forecast = model.predict(future)

prophet_forecast = forecast[['ds','yhat','yhat_upper','yhat_lower']]
prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast.columns = ['DATE', 'yhat','yhat_upper','yhat_lower']
prophet_forecast.set_index('DATE', inplace=True)
prophet_forecast = prophet_forecast.asfreq('W-SAT')
prophet_forecast = prophet_forecast.loc[(prophet_forecast.index >= "2024-03-23") & (prophet_forecast.index <= "2024-04-13")]
In [32]:
prophet_forecast
Out[32]:
yhat yhat_upper yhat_lower
DATE
2024-03-23 93763.73 110608.42 75509.07
2024-03-30 98443.33 116404.55 81232.11
2024-04-06 108617.99 126353.05 92715.61
2024-04-13 115180.50 130721.01 97186.91

values in yhat column above are the forecasted unit sales during the Easter weeks.

In [34]:
plt.figure(figsize=(10, 6))
plt.plot(prophet_forecast.index, prophet_forecast['yhat'], label='Predicted', color='blue', marker='o')
plt.fill_between(prophet_forecast.index, prophet_forecast['yhat_lower'], prophet_forecast['yhat_upper'], color='gray', alpha=0.2, label='Confidence Interval')
plt.xticks(prophet_forecast.index)
plt.title('Forecast with Confidence Interval')
plt.xlabel('Date')
plt.ylabel('Forecasted UNIT_SALES')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Summary¶

Given the new data and insights, our exploration into forecasting the demand for Swire's innovative product, the Sparkling Jacceptabletlester Avocado 11Small MLT around Easter 2024, was meticulously conducted using advanced time series analysis. The process commenced with a thorough data preparation phase, where the market demand dataset was strategically filtered to hone in on relevant parameters: avocado flavor, regular caloric segment, SSD market category, Swire-CC as the manufacturer, and a focus on small package types. This precise data curation was pivotal in aligning the analysis with the specific product profile of interest, ensuring a targeted approach to forecasting.

Upon evaluation, the forecasting models—ARIMA, SARIMA, Prophet, and Exponential Smoothing—underwent rigorous performance assessment through a suite of metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Direct 'Accuracy' Percentage. These metrics served as critical indicators of each model's predictive accuracy and reliability, essential for identifying the most suitable forecasting tool for the task at hand.

The performance summary revealed a standout model, differing from the initial assessment, underscoring the dynamic nature of forecasting methodologiTs, the Prophet model demonstrated exceptional prowess, outperforming others in key metrics. Its predictive capabilities showcased not just through lower MAE and RMSE values but also in its superior MAPE and Direct Accuracy percentages. These strengths indicate the Prophet model's predictions are not only close to actual sales data but also consistently reliable across different forecasting scenarios.

Interestingly, while the Exponential Smoothing model showed promise in the initial analysis, a reevaluation based on the latest insights pointed towards the Prophet model's unmatched efficiency and adaptability, particularly in handling seasonal variations around Easter 2024. The Prophet model's ability to incorporate seasonal trends, holidays, and other external variables makes it exceptionally suitable for forecasting demand for products like the Sparkling Jacceptabletlester Avocado 11Small MLT, which likely sees fluctuating demand due to seasonal events such as Easter.

The forecast from the Prophet model suggests a nuanced understanding of demand dynamics around Easter 2024, projecting an upswing in sales that peaks in the week leading up to Easter, before moderating in the following weeks. This pattern offers Swire invaluable strategic insights, enabling the company to optimize production, distribution, and marketing efforts to fully capitalize on the anticipated demand surge.

In conclusion, the Prophet model, with its robust handling of complex seasonal patterns and its adeptness in forecasting under conditions of uncertainty, stands out as the superior choice for predicting the demand for Swire's new avocado-flavored beverage around Easter 2024. The insights derived from this model provide a solid foundation for informed decision-making, ensuring that Swire can navigate the seasonal market dynamics effectively and maximize the product's success in the competitive beverage landscape.

Q3

Item Description: Diet Venomous Blast Energy Drink Kiwano 16 Liquid Small

  • a. Caloric Segment: Diet
  • b. Market Category: Energy
  • c. Manufacturer: Swire-CC
  • d. Brand: Venomous Blast
  • e. Package Type: 16 Liquid Small
  • f. Flavor: ’Kiwano’

Which 13 weeks of the year would this product perform best in the market? What is the forecasted demand, in weeks, for those 13 weeks?

Data Preparation¶

Kiwano + Diet + Swire¶

In [5]:
# filtering for kiwano
kiwano = market_demand[(market_demand['ITEM'].str.contains('KIWANO', case=False, regex=True))]

# Filtering Swire + kiwano data
kiwano_swire = kiwano[kiwano['MANUFACTURER'] == 'SWIRE-CC']

# Filtering Swire +kiwano + Diet
kiwano_swire_diet = kiwano_swire[kiwano_swire['CALORIC_SEGMENT']  == 'DIET/LIGHT']

Grouping by date to aggregate daily sales data:

In [6]:
# Assuming 'DATE' column is in datetime format
kiwano_swire_diet['DATE'] = pd.to_datetime(kiwano_swire_diet['DATE'])
kiwano_swire_diet_sorted= kiwano_swire_diet.sort_values('DATE')

# Aggregate by Date and compute unit sales
kiwano_swire_diet_sorted_daily = kiwano_swire_diet.groupby('DATE')['UNIT_SALES'].sum().reset_index()

Adding week number column:

In [8]:
earliest_date = kiwano_swire_diet_sorted_daily['DATE'].min()

# Step 2: Calculate week numbers
kiwano_swire_diet_sorted_daily['WEEK_NUMBER'] = ((kiwano_swire_diet_sorted_daily['DATE'] - earliest_date).dt.days // 7) % 52

# Step 3: Aggregate the data for every week
weekly_data_kiwano = kiwano_swire_diet_sorted_daily.groupby('WEEK_NUMBER')['UNIT_SALES'].sum().reset_index()

Aggregating data by a 13 week interval and plotting the rolling 13 week sum sales:

In [9]:
import pandas as pd

# Assuming weekly_data is your DataFrame containing 'WEEK_NUMBER' and 'UNIT_SALES' columns

# Sort the DataFrame by 'WEEK_NUMBER' if it's not already sorted
weekly_data_kiwano = weekly_data_kiwano.sort_values(by='WEEK_NUMBER')

# Ensure that there are enough rows to compute the rolling sum
if len(weekly_data_kiwano) < 13:
    print("Insufficient data to compute rolling sum.")
else:
    # Initialize an empty list to store the results
    rolling_sales_sum = []

    # Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
    for i in range(len(weekly_data_kiwano) - 12):
        # Calculate the sum of UNIT_SALES for the current 13-week interval
        sales_sum = weekly_data_kiwano.loc[i:i+12, 'UNIT_SALES'].sum()
        # Construct the 13-week interval string
        interval = f"{weekly_data_kiwano.iloc[i]['WEEK_NUMBER']}-{weekly_data_kiwano.iloc[i+12]['WEEK_NUMBER']}"
        # Append the result to the list
        rolling_sales_sum.append((sales_sum, interval))

    # Create a new DataFrame from the results
    rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
In [10]:
import matplotlib.pyplot as plt

# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns

# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']

# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]

# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
             xy=(max_interval, max_value),
             xytext=(70, -20),
             textcoords='offset points',
             arrowprops=dict(facecolor='red', arrowstyle='->'),
             fontsize=10,
             color='red',
             ha='center')

plt.tight_layout()
plt.show()

From the plot, we could clearly see that the maximum sales are observed from the week 19- Week 31.

Prophet Model¶

  • Prophet modelling is considered as the best fit for forecasting highest sales in 13-Week Interval because Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.
  • Prophet model not only helps in validating our historical data but also forecast and predicts the future sales using the historical data.

Train and Test Split:¶

The data was split into training and testing sets with 80 and 20 % ratio

  • Training Set (train_data): This set will be used for training our prophet model.
  • Testing Set (test_data): This set will be reserved for evaluating the performance of our model. It helps us understand how well the model is performing as we have back-up results to compare the predicted results.
In [11]:
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
import pandas as pd



# Calculate the index for partitioning the data
partition_index = int(len(kiwano_swire_diet_sorted_daily) * 0.8)

# Split the data into train and test sets
train_data = kiwano_swire_diet_sorted_daily.iloc[:partition_index]
test_data = kiwano_swire_diet_sorted_daily.iloc[partition_index:]

# Rename columns as per Prophet's requirements
train_data = train_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
test_data = test_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

# Print the lengths of train and test dataframes
print("Length of Train Data:", len(train_data))
print("Length of Test Data:", len(test_data))

# Print the start and end dates of train and test dataframes
print("Start Date of Train Data:", train_data['ds'].min())
print("End Date of Train Data:", train_data['ds'].max())
print("Start Date of Test Data:", test_data['ds'].min())
print("End Date of Test Data:", test_data['ds'].max())
Length of Train Data: 118
Length of Test Data: 30
Start Date of Train Data: 2020-12-05 00:00:00
End Date of Train Data: 2023-03-04 00:00:00
Start Date of Test Data: 2023-03-11 00:00:00
End Date of Test Data: 2023-10-28 00:00:00
In [15]:
# Python
m = Prophet()
m.fit(train_data)
pred_df = m.predict(test_data)
pred_df.head()
12:27:59 - cmdstanpy - INFO - Chain [1] start processing
12:28:00 - cmdstanpy - INFO - Chain [1] done processing
Out[15]:
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
0 2023-03-11 19485.394281 15920.064503 19464.224109 19483.843303 19488.275912 -1759.759261 -1759.759261 -1759.759261 -1759.759261 -1759.759261 -1759.759261 0.0 0.0 0.0 17725.635020
1 2023-03-18 19247.662322 16487.443642 19995.816392 19232.930247 19265.132350 -1028.066953 -1028.066953 -1028.066953 -1028.066953 -1028.066953 -1028.066953 0.0 0.0 0.0 18219.595369
2 2023-03-25 19009.930363 17453.491002 20805.587420 18976.524053 19051.408845 49.613071 49.613071 49.613071 49.613071 49.613071 49.613071 0.0 0.0 0.0 19059.543434
3 2023-04-01 18772.198405 18148.134072 21589.310780 18712.260886 18844.356146 1040.777482 1040.777482 1040.777482 1040.777482 1040.777482 1040.777482 0.0 0.0 0.0 19812.975887
4 2023-04-08 18534.466446 18501.014865 21900.768394 18440.762472 18635.148118 1679.434987 1679.434987 1679.434987 1679.434987 1679.434987 1679.434987 0.0 0.0 0.0 20213.901433

Above are predicted results on the test data.

In [13]:
fig1 = m.plot(pred_df)

The dotted area shows the actual values and the blue region shows the future prediction

Validation¶

To evaluate the performance of a Prophet model for time series forecasting, we have used the following metrics to gauge its accuracy and effectiveness.

Mean Absolute Percentage Error (MAPE): MAPE measures the percentage difference between actual and predicted values, providing insight into the overall accuracy of the forecasts. It gives us a better understanding of the magnitude of errors relative to the actual values.

Mean Absolute Error (MAE): MAE measures the average magnitude of errors between predicted and actual values, regardless of their direction. This method is robust to outliers and can help in the better interpetation of the results.

Mean Squared Error (MSE): MSE measures the average squared difference between predicted and actual values, emphasizing larger errors due to the squaring effect. MSE is widely used for assessing the goodness of fit of a model, penalizing larger errors more heavily.

Root Mean Squared Error (RMSE): RMSE is the square root of the MSE and shares its characteristics while providing results in the same units as the original data. It is widely used metric as the results are easy and intuitive interpretation as we can relate with the original data.

Accuracy: Accuracy would provide a interpretation on how accurate our model is performing.

All these metrics collectively offer a comprehensive assessment of the Prophet model's performance in forecasting time series data and for better decision making.

In [16]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
In [17]:
def dividefn(num1, num2):
  try:
    return num1/num2
  except ZeroDivisionError:
    return 0
In [18]:
def mean_absolute_percentage_error(y_true, y_pred):
  y_true, y_pred = np.array(y_true), np.array(y_pred)
  out = [0] * len(y_true)
  for i in np.arange(len(y_true)):
    a = (y_pred[i] - y_true[i])*1
    b = y_true[i]
    out[i] = dividefn(a,b)
  return np.mean(np.abs(out))*100
In [19]:
def val_metrics_sales(inp_df):
  metric= []
  mape = np.round(mean_absolute_percentage_error(test_data['y'],pred_df['yhat']),3)
  mae = np.round(mean_absolute_error(test_data['y'], pred_df['yhat']),3)
  mse_error = np.round(mean_squared_error(test_data['y'],pred_df['yhat']),3)
  rmse = np.round(np.sqrt(mean_squared_error(test_data['y'],pred_df['yhat'])),3)
  accuracy = np.round((100 - abs(mape)),3)
  metric.append([mape,mae,mse_error,rmse,accuracy])
  metric = np.array(metric)
  return metric
In [20]:
Validation_metric = val_metrics_sales(pred_df)
In [21]:
print("MAPE:" , Validation_metric[0][0])
print("MAE:" , Validation_metric[0][1])
print("MSE_Error:" , Validation_metric[0][2])
print("RMSE:" , Validation_metric[0][3])
print("Accuracy:" , Validation_metric[0][4])
MAPE: 20.181
MAE: 3731.182
MSE_Error: 19613458.552
RMSE: 4428.708
Accuracy: 79.819

Our model gave 80 % accuracy with Mape error of 20%.

Demand Forecasting for upcoming one year:¶

In [22]:
kiwano_swire_diet_sorted_daily.tail(1)
Out[22]:
DATE UNIT_SALES WEEK_NUMBER
147 2023-10-28 15549.67 47

The last date is 2023-10-28

In [24]:
kiwano_future = kiwano_swire_diet_sorted_daily.copy()

#  Renaming columns for prophet requirements
kiwano_future.rename(columns = {'DATE':'ds','UNIT_SALES':'y'}, inplace = True)
In [25]:
m_2 = Prophet()
m_2.fit(kiwano_future)

# Create future dates for forecasting
future_kiwano = m_2.make_future_dataframe(periods=52,freq = 'W')

# Forecasting for 1 year
future_kiwano_forecast = m_2.predict(future_kiwano)

future_kiwano_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
12:30:42 - cmdstanpy - INFO - Chain [1] start processing
12:30:42 - cmdstanpy - INFO - Chain [1] done processing
Out[25]:
ds yhat yhat_lower yhat_upper
195 2024-09-22 9112.684434 5202.013051 13014.200778
196 2024-09-29 8509.657160 4758.182902 12584.130980
197 2024-10-06 7670.426502 3965.991301 11509.779202
198 2024-10-13 6735.039638 2736.616109 10611.176965
199 2024-10-20 6006.095057 1857.985922 10155.324535

Above are the prediction results

In [ ]:
plot_plotly(m_2, future_kiwano_forecast)

image.png

In [27]:
# Filtering out prediction to get forecasted sales
forecast_future = future_kiwano_forecast[future_kiwano_forecast['ds'] > '2023-10-28']
forecast_future
# # No Negative Sales
forecast_future['yhat'] = forecast_future['yhat'].clip(lower=0)
In [28]:
forecast_future_date_sales = forecast_future[['ds','yhat']]
In [32]:
forecast_future_date_sales.head(5)
Out[32]:
ds yhat
148 2023-10-29 13004.568376
149 2023-11-05 12468.837648
150 2023-11-12 11637.278663
151 2023-11-19 10663.885980
152 2023-11-26 9845.748470

Extracted data for 1 year forecast

In [33]:
earliest_date = forecast_future_date_sales['ds'].min()

# Step 2: Calculate week numbers
forecast_future_date_sales['WEEK_NUMBER'] = ((forecast_future_date_sales['ds'] - earliest_date).dt.days // 7) % 52

# Step 3: Aggregate the data for every week
weekly_data_kiwano_forecasted = forecast_future_date_sales.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
In [34]:
import pandas as pd

# Assuming weekly_data is your DataFrame containing 'WEEK_NUMBER' and 'UNIT_SALES' columns

# Sort the DataFrame by 'WEEK_NUMBER' if it's not already sorted
weekly_data_kiwano_forecasted = weekly_data_kiwano_forecasted.sort_values(by='WEEK_NUMBER')

# Ensure that there are enough rows to compute the rolling sum
if len(weekly_data_kiwano_forecasted) < 13:
    print("Insufficient data to compute rolling sum.")
else:
    # Initialize an empty list to store the results
    rolling_sales_sum = []

    # Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
    for i in range(len(weekly_data_kiwano_forecasted) - 12):
        # Calculate the sum of UNIT_SALES for the current 13-week interval
        sales_sum = weekly_data_kiwano_forecasted.loc[i:i+12, 'yhat'].sum()
        # Construct the 13-week interval string
        interval = f"{weekly_data_kiwano_forecasted.iloc[i]['WEEK_NUMBER']}-{weekly_data_kiwano_forecasted.iloc[i+12]['WEEK_NUMBER']}"
        # Append the result to the list
        rolling_sales_sum.append((sales_sum, interval))

    # Create a new DataFrame from the results
    rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
In [35]:
import matplotlib.pyplot as plt

# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns

# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']

# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]

# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
             xy=(max_interval, max_value),
             xytext=(70, -20),
             textcoords='offset points',
             arrowprops=dict(facecolor='red', arrowstyle='->'),
             fontsize=10,
             color='red',
             ha='center')

plt.tight_layout()
plt.show()

Summary¶

  • From the above forecasted plot, it can be interpreted that the maximum sales in the next one year is from the week 24 to week 36 whereas,
  • Our historical data have highest sales from the week 19 - week 31.

  • Our forecasting demand range is similar to the range of already existing historical data demand, making our results valid and much more derivable in the future.

  • The best 13 weeks for the forecasted year is 24 - 36 and the total units sales is 186 K.

Q4 ¶

  • Item Description: Diet Square Mulberries Sparkling Water 10Small MLT
  • Caloric Segment: Diet
  • Market Category: Sparkling Water
  • Manufacturer: Swire-CC
  • Brand: Square
  • Package Type: 10Small MLT
  • Flavor: ‘Mulberries
  • Swire plans to release this product for the duration of 1 year but only in the Northern region. What will the forecasted demand be, in weeks, for this product?

This is the consideration with Utah being the central point

  • North West States: Washington (WA), Oregon (OR)
  • North States: Idaho (ID), Wyoming (WY), Washington (WA), Oregon (OR)
  • South States: Arizona (AZ), New Mexico (NM)
  • West States: Nevada (NV), California (CA)
  • Central States: Utah (UT), Colorado (CO)
  • East States: Nebraska (NE), South Dakota (SD), Kansas (KS)
In [1]:
# Function to calculate the scores

from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
def get_scores(actual, predicted):
# Calculate errors
    mae = mean_absolute_error(actual, predicted)
    mse = mean_squared_error(actual, predicted)
    rmse = sqrt(mse)
    #percentage_diff = (np.abs(actual - predicted).sum()) /( np.abs(actual).mean())
    
    # Combine actual and predicted values into one DataFrame
    mape_df = pd.DataFrame({'actual': actual.values, 'predicted': predicted.values})

# Calculate absolute percentage error
    mape_df['absolute_percentage_error'] = np.abs((mape_df['actual'] - mape_df['predicted']) / mape_df['actual']) * 100

# Calculate MAPE
    mape = mape_df['absolute_percentage_error'].mean()
    
# Calculate MAPE
    #mape = percentage_diff.mean()
# Calculate "Accuracy" Percentage
    accuracy_percentage = 100 - mape
# Print metrics
#     print(f"MAE: {mae}")
#     print(f"MSE: {mse}")
#     print(f"RMSE: {rmse}")
#     print(f"MAPE: {mape}%")
#     print(f"Direct 'Accuracy' Percentage: {accuracy_percentage}%")
    return pd.Series(data={'mae':mae, 'mse':mse, 'rmse':rmse, 'mape':mape, 'direct_accuracy':accuracy_percentage}, index=['mae', 'mse', 'rmse','mape','direct_accuracy'])
In [17]:
# Reading the zip_to_market data
zip_to_market = pd.read_csv(r"zip_to_market_unit_mapping.csv")
consumer_demographics = pd.read_csv(r"pivoted consumer data.csv")

Mapping states to region:

In [292]:
# Dictionary mapping states to regions
state_regions = {
    'WA': 'North West',
    'OR': 'North West',
    'ID': 'North',
    'WY': 'North',
    'AZ': 'South',
    'NM': 'South',
    'NV': 'West',
    'CA': 'West',
    'UT': 'Central',
    'CO': 'Central',
    'NE': 'East',
    'SD': 'East',
    'KS': 'East'
}

# Function to get region for a state
def get_region(state):
    return state_regions.get(state, 'Unknown')

# Add a new column 'Region' based on 'State'
consumer_demographics['Region'] = consumer_demographics['State'].map(get_region)

Filtering for North region:

In [293]:
# Selecting only Zip and Region column
consumer_demographics_zip_region = consumer_demographics[['Zip', 'Region']]

# Filtering the dataset to consider only North regions
consumer_demographics_north = consumer_demographics_zip_region[consumer_demographics_zip_region['Region'] == 'North']

Joining datasets and removing unncecessary columns:

In [294]:
# Perform inner join on 'Zip' and 'ZIP_CODE' columns
merged_data = pd.merge(consumer_demographics_north, zip_to_market, left_on='Zip', right_on='ZIP_CODE', how='inner')

# Remove duplicates in MARKET_KEY and Region
merged_data = merged_data.drop_duplicates(subset=['MARKET_KEY', 'Region'])

## Dropping Zip Column
merged_data.drop(columns='Zip', inplace=True)

# Dropping the Zip Code Column
merged_data.drop(columns='ZIP_CODE', inplace=True)

checking Null values in the merged dataset:

In [297]:
# Combining market_demand with zip_customer dataset
final_merged_data = pd.merge(market_demand, merged_data, on='MARKET_KEY', how='inner')

# Checking if there are null values
final_merged_data.isnull().sum().sum()
Out[297]:
0

Mulberry + Swire + Diet

In [298]:
# Filtering the mulberry flavor from item
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
mulberry_swire = mulberry[mulberry['MANUFACTURER'] == 'SWIRE-CC']
mulberry_swire_diet = mulberry_swire[mulberry_swire['CALORIC_SEGMENT']  == 'DIET/LIGHT']
# Printing the shape of the new dataset which contains only mulberry
mulberry.shape
Out[298]:
(12784, 11)
In [300]:
# Printing the shape
mulberry_swire_diet.shape
Out[300]:
(1708, 11)

There are about 1700 records with the filters applied.

In [301]:
# Convert 'DATE' column to datetime format
mulberry_swire_diet['DATE'] = pd.to_datetime(mulberry_swire_diet['DATE'])

# Grouping by date column and computing unit sales to eliminate duplicates
weekly_data = mulberry_swire_diet.groupby('DATE')['UNIT_SALES'].sum().reset_index()

# Group by year and sum the UNIT_SALES
yearly_sales = weekly_data.groupby(weekly_data['DATE'].dt.year)['UNIT_SALES'].sum()

# Create a new DataFrame from the grouped data
yearly_sales_df = pd.DataFrame({'Year': yearly_sales.index, 'Unit_Sales': yearly_sales.values})

# Printing the dataframe
yearly_sales_df
Out[301]:
Year Unit_Sales
0 2020 4187.0
1 2021 50501.0
2 2022 7165.0
3 2023 156.0

The unit sales is highest for the year 2021 with 50K followed by 2022 and 2020. The least unit sales is from the year 2023 which accounts for a mere 156 sales. There is a negative trend after the year 2021.

In [302]:
import matplotlib.pyplot as plt

# Find the peak sales year and value
peak_sales_year = yearly_sales_df.loc[yearly_sales_df['Unit_Sales'].idxmax(), 'Year']
peak_sales_value = yearly_sales_df['Unit_Sales'].max()

# Plot
plt.figure(figsize=(10, 6))
plt.plot(yearly_sales_df['Year'], yearly_sales_df['Unit_Sales'], marker='o', linestyle='-')
plt.title('Yearly Unit Sales')
plt.xlabel('Year')
plt.ylabel('Unit Sales')
plt.grid(True)
plt.xticks(yearly_sales_df['Year'])  # Show all years on x-axis

# Mark the peak sales value
plt.annotate(f'Peak: {peak_sales_value} units', xy=(peak_sales_year, peak_sales_value),
             xytext=(peak_sales_year + 1, peak_sales_value + 1000),
             arrowprops=dict(facecolor='black', shrink=0.05))

plt.tight_layout()
plt.show()

There is a negative trend for mulberry + swire + diet filtered products from 2021 onwards.

Prophet Model¶

In [303]:
# Converting  Date to datetime variable
mulberry_swire_diet['DATE'] = pd.to_datetime(mulberry_swire_diet['DATE'])

# Grouping by date to ensure only one record for each date
mulberry_swire_diet = mulberry_swire_diet.groupby('DATE')['UNIT_SALES'].sum().reset_index()

Training the entire dataset for forecasting

Swire + Mulberry + Diet¶

In [304]:
# Extracting mulberry flavor
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]

# Filtering out the Swire sales
mulberry_swire = mulberry[mulberry['MANUFACTURER'] == 'SWIRE-CC']

# Filtering out swire+mulberry+diet
mulberry_swire_diet = mulberry_swire[mulberry_swire['CALORIC_SEGMENT']  == 'DIET/LIGHT']
In [305]:
# Copying the datset 
mulberry_swire_diet_forecast = mulberry_swire_diet.copy()
In [306]:
# Converting to date time
mulberry_swire_diet_forecast['DATE'] = pd.to_datetime(mulberry_swire_diet_forecast['DATE'])

# Grouping to ensure one record for each date
mulberry_swire_diet_forecast_daily = mulberry_swire_diet_forecast.groupby('DATE')['UNIT_SALES'].sum().reset_index()

# Sorting by date column
mulberry_swire_diet_forecast = mulberry_swire_diet_forecast_daily.sort_values(by='DATE')

Data prepartion for Prophet:

In [307]:
# Renaming for Prophet model requirements
mulberry_swire_diet_forecast = mulberry_swire_diet_forecast.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

# Extracting the last date
mulberry_swire_diet_forecast.tail(1)
Out[307]:
ds y
134 2023-07-01 1.0

The last date is 2023-07-01. Building the model:

In [ ]:
# Creating an instance for the model
p_model1 = Prophet()

# Fitting the model
p_model1.fit(mulberry_swire_diet_forecast)

# Create future dates for forecasting for 1 year
future = p_model1.make_future_dataframe(periods=52,freq = 'W')

# Make predictions
forecast1 = p_model1.predict(future)

# Plot the forecast
# plot_plotly(p_model1, forecast1)
In [309]:
# Extracting the future sales for 1 year
forecast_future = forecast1[forecast1['ds'] > '2023-07-01']

# Printing to ensure if the filter is working right
forecast_future.head(2)
Out[309]:
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
135 2023-07-02 -110.053709 -156.706731 201.795569 -110.053709 -110.053709 135.114942 135.114942 135.114942 135.114942 135.114942 135.114942 0.0 0.0 0.0 25.061234
136 2023-07-09 -114.546123 -197.287970 165.726691 -114.818539 -114.317650 100.385351 100.385351 100.385351 100.385351 100.385351 100.385351 0.0 0.0 0.0 -14.160772
In [310]:
# Removing any negative sales
forecast_future['yhat'] = forecast_future['yhat'].clip(lower=0)
In [311]:
# Computing yearly sales
yearly_forecast_mulberry = forecast_future.groupby(forecast_future['ds'].dt.year)['yhat'].sum()

# Create a new DataFrame from the grouped data
yearly_forecast_mulberry_df = pd.DataFrame({'Year': yearly_forecast_mulberry.index, 'Unit_Sales': yearly_forecast_mulberry.values})

# Printing the data
yearly_forecast_mulberry_df
Out[311]:
Year Unit_Sales
0 2023 25.061234
1 2024 0.000000
  • Swire and Mulberry of type Diet has not crossed even 100 unit sales in the forecasted year. This could be because Swire has not experimented with this flavor or of type diet in the last year.

Considering various combinations to determine why are the sales dipping in swire + diet + mulberry.

Only Mulberry Prophet model¶

In [321]:
#### Extracing the mulberry flavor
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
mulberry_forecast = mulberry.copy()

# Converting to date time variable
mulberry_forecast['DATE'] = pd.to_datetime(mulberry_forecast['DATE'])

# No duplicate date rows
mulberry_forecast_daily = mulberry_forecast.groupby('DATE')['UNIT_SALES'].sum().reset_index()

# Sort according to the date
mulberry_forecast = mulberry_forecast_daily.sort_values(by='DATE')

# Rename for prophet requirements
mulberry_forecast = mulberry_forecast.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

# Printing the last date 
mulberry_forecast.tail(5)
Out[321]:
ds y
143 2023-09-30 4307.0
144 2023-10-07 4330.0
145 2023-10-14 4467.0
146 2023-10-21 4254.0
147 2023-10-28 3974.0
In [ ]:
p_model2 = Prophet()
p_model2.fit(mulberry_forecast)

# Create future dates for forecasting
future = p_model2.make_future_dataframe(periods=52,freq = 'W')

# Make predictions
forecast2 = p_model2.predict(future)

# Plot the forecast
# plot_plotly(p_model2, forecast2)

Mulberry sales are better compared Mulberry + Swire + Diet

In [18]:
# Extracting the forecated sales
forecast_future = forecast2[forecast2['ds'] > '2023-10-28']
In [324]:
forecast_future.tail(1)
Out[324]:
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
199 2024-10-20 3778.492273 2305.154067 3752.276782 3405.329831 4128.001289 -712.177312 -712.177312 -712.177312 -712.177312 -712.177312 -712.177312 0.0 0.0 0.0 3066.314961

The trend might be decreasing but there are still postive sales in 1000's.

In [325]:
forecast_future['yhat'] = forecast_future['yhat'].clip(lower=0)

yearly_forecast_mulberry = forecast_future.groupby(forecast_future['ds'].dt.year)['yhat'].sum()

# Create a new DataFrame from the grouped data
yearly_forecast_mulberry_df = pd.DataFrame({'Year': yearly_forecast_mulberry.index, 'Unit_Sales': yearly_forecast_mulberry.values})

yearly_forecast_mulberry_df
Out[325]:
Year Unit_Sales
0 2023 44463.494491
1 2024 173997.204651

The forecasted sales for 1 year is 218,460. Mulberry (or flavor) is not the problem for low unit sales in Mulberry + Diet + Swire

Mulberry + Diet¶

In [326]:
# Extracing mulberry flavor
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]

# Filtering for diet 
mulberry_diet = mulberry[mulberry['CALORIC_SEGMENT']  == 'DIET/LIGHT']
mulberry_diet_forecast = mulberry_diet.copy()

# Converting to date time
mulberry_diet_forecast['DATE'] = pd.to_datetime(mulberry_diet_forecast['DATE'])

# Ensuring no duplicate records for date
mulberry_diet_forecast_daily = mulberry_diet_forecast.groupby('DATE')['UNIT_SALES'].sum().reset_index()

# Sorting the date
mulberry_diet_forecast_daily = mulberry_diet_forecast_daily.sort_values(by='DATE')

# Renaming for prophet requiremts
mulberry_diet_forecast = mulberry_diet_forecast_daily.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

# Printing the last date
mulberry_diet_forecast.tail(3)
Out[326]:
ds y
145 2023-10-14 4467.0
146 2023-10-21 4254.0
147 2023-10-28 3974.0
In [ ]:
# Creating an instance for the model
p_model4 = Prophet()

# Fitting the model
p_model4.fit(mulberry_diet_forecast)

# Create future dates for forecasting
future = p_model4.make_future_dataframe(periods=52,freq = 'W')

# Make predictions
forecast3 = p_model4.predict(future)

# Plot the forecast
# plot_plotly(p_model4, forecast3)

Lets forecast for the last two years:

In [328]:
# Extracting the future sales for 1 year
forecast_future = forecast3[forecast3['ds'] > '2023-10-28']
In [329]:
# Ensuring no negative unit sales
forecast_future['yhat'] = forecast_future['yhat'].clip(lower=0)

# Compute yearly sales
yearly_forecast_mulberry_diet = forecast_future.groupby(forecast_future['ds'].dt.year)['yhat'].sum()

# Create a new DataFrame from the grouped data
yearly_forecast_mulberry_diet_df = pd.DataFrame({'Year': yearly_forecast_mulberry_diet.index, 'Unit_Sales': yearly_forecast_mulberry_diet.values})
yearly_forecast_mulberry_diet_df
Out[329]:
Year Unit_Sales
0 2023 37951.326633
1 2024 178579.675365

The forecasted sales for 1 year is 216K. Mulberry (or flavor) + Diet (Caloric Segment) is not the problem for low unit sales in Mulberry + Diet + Swire. Lets check for Swire now:

Mulberry + Swire¶

In [332]:
# Extracting mulberry flavor
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]

# Filtering Swire sales
mulberry_swire = mulberry[mulberry['MANUFACTURER'] == 'SWIRE-CC']
mulberry_swire_forecast = mulberry_swire.copy()

# Converting to Date Time
mulberry_swire_forecast['DATE'] = pd.to_datetime(mulberry_swire_forecast['DATE'])

# Aggregating by date
mulberry_swire_forecast_daily = mulberry_swire_forecast.groupby('DATE')['UNIT_SALES'].sum().reset_index()

# Sorting by date
mulberry_swire_forecast = mulberry_swire_forecast_daily.sort_values(by='DATE')

# Renaming the columns for Prophet requirements
mulberry_swire_forecast = mulberry_swire_forecast.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

# Printing the last date
mulberry_swire_forecast.tail(1)
Out[332]:
ds y
134 2023-07-01 1.0
In [ ]:
# Creating an instance for the Prophet model
p_model3 = Prophet()

# Fitting the modle
p_model3.fit(mulberry_swire_forecast)

# Create future dates for forecasting
future = p_model3.make_future_dataframe(periods=52,freq = 'W')

# Make predictions
forecast2 = p_model3.predict(future)

# Plot the forecast
# plot_plotly(p_model3, forecast2)
In [334]:
# Future Sales
forecast_future = forecast2[forecast2['ds'] > '2023-07-01']

# No Negative Sales
forecast_future['yhat'] = forecast_future['yhat'].clip(lower=0)

# Year wise sales
yearly_forecast_mulberry_swire = forecast_future.groupby(forecast_future['ds'].dt.year)['yhat'].sum()

# Create a new DataFrame from the grouped data
yearly_forecast_mulberry_df = pd.DataFrame({'Year': yearly_forecast_mulberry_swire.index, 'Unit_Sales': yearly_forecast_mulberry_swire.values})
yearly_forecast_mulberry_df
Out[334]:
Year Unit_Sales
0 2023 27.580757
1 2024 0.000000
  • Swire and Mulberry has not crossed even 100 unit sales(thats the problem for low swire + mulberry + diet sales) in the forecasted year. Maybe Swire has not experimented with this flavor in the last year.
  • So will consider Diet + Mulberry + Sparkling and then take Swire sales into account.

Diet + Mulberry + Sparkling¶

In [335]:
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
mulberry_diet = mulberry[mulberry['CALORIC_SEGMENT']  == 'DIET/LIGHT']
In [336]:
mulberry_diet['CATEGORY'].value_counts()
Out[336]:
CATEGORY
SPARKLING WATER       8399
ING ENHANCED WATER    3679
SSD                     18
Name: count, dtype: int64
In [337]:
mulberry_diet_sparkling = mulberry_diet[mulberry_diet['CATEGORY']  == 'SPARKLING WATER']

Prophet Model for validation of Sparkling Water + Diet + Mulberry¶

In [349]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Assuming mulberry_diet_sparkling_validation is your DataFrame
mulberry_diet_sparkling_validation = mulberry_diet_sparkling.copy()

# Convert Date to datetime variable
mulberry_diet_sparkling_validation['DATE'] = pd.to_datetime(mulberry_diet_sparkling_validation['DATE'])

# Group by date to ensure only one record for each date
mulberry_diet_sparkling_validation = mulberry_diet_sparkling_validation.groupby('DATE')['UNIT_SALES'].sum().reset_index()

# Sort the data by date
mulberry_diet_sparkling_validation = mulberry_diet_sparkling_validation.sort_values(by='DATE')

# Calculate the index for partitioning the data
partition_index = int(len(mulberry_diet_sparkling_validation) * 0.8)

# Split the data into train and test sets
train_data = mulberry_diet_sparkling_validation.iloc[:partition_index]
test_data = mulberry_diet_sparkling_validation.iloc[partition_index:]

# Renaming for Prophet requirements
train_data = train_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
test_data = test_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

p_model = Prophet()
p_model.fit(train_data)

# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=34, freq='W')

# Make predictions
forecast = p_model.predict(future)

# Extracting the date and unit sales from forecast
pred = forecast[['ds','yhat']]

# Plotting actual vs forecasted sales
plt.plot(train_data['ds'], train_data['y'], label='Actual Train Set', color='blue')
plt.plot(test_data['ds'], test_data['y'], label='Actual Test Set', color='green')
plt.plot(pred['ds'], pred['yhat'], label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Actual vs Forecast')
plt.legend()
plt.xticks(rotation=45)
plt.axhline(0, color='black', linewidth=0.5)
plt.show()
22:37:56 - cmdstanpy - INFO - Chain [1] start processing
22:37:56 - cmdstanpy - INFO - Chain [1] done processing

Model appears to be doing a fine job. Checking error metrics:

In [400]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Assuming mulberry_diet_sparkling_validation is your DataFrame
mulberry_diet_sparkling_validation = mulberry_diet_sparkling.copy()

# Convert Date to datetime variable
mulberry_diet_sparkling_validation['DATE'] = pd.to_datetime(mulberry_diet_sparkling_validation['DATE'])

# Group by date to ensure only one record for each date
mulberry_diet_sparkling_validation = mulberry_diet_sparkling_validation.groupby('DATE')['UNIT_SALES'].sum().reset_index()

# Sort the data by date
mulberry_diet_sparkling_validation = mulberry_diet_sparkling_validation.sort_values(by='DATE')

# Calculate the index for partitioning the data
partition_index = int(len(mulberry_diet_sparkling_validation) * 0.8)

# Split the data into train and test sets
train_data = mulberry_diet_sparkling_validation.iloc[:partition_index]
test_data = mulberry_diet_sparkling_validation.iloc[partition_index:]

# Renaming for Prophet requirements
train_data = train_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
test_data = test_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

p_model = Prophet()
p_model.fit(train_data)

# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=0, freq='W')

# Make predictions
forecast = p_model.predict(future)

# Extracting the date and unit sales from forecast
pred = forecast[['ds','yhat']]

get_scores(train_data['y'], pred['yhat'])
23:14:15 - cmdstanpy - INFO - Chain [1] start processing
23:14:15 - cmdstanpy - INFO - Chain [1] done processing
Out[400]:
mae                  226.692952
mse                94751.882440
rmse                 307.817937
mape                   3.956043
direct_accuracy       96.043957
dtype: float64
In [399]:
forecast = p_model.predict(test_data)

pred = forecast[['ds','yhat']]
get_scores(test_data['y'], pred['yhat'])
Out[399]:
mae                   654.519767
mse                618493.972874
rmse                  786.443878
mape                   13.653096
direct_accuracy        86.346904
dtype: float64

The MAPE error is 13 %. Model is working fine no overfitting. Next is to forecast where we train the entire dataset and forescast for the next one year.

Training on entire dataset for forecasting sales for the next one year¶

In [340]:
mulberry_diet_sparkling['DATE'] = pd.to_datetime(mulberry_diet_sparkling['DATE'])

# Ensuring no duplicate records for date
mulberry_diet_sparkling_daily = mulberry_diet_sparkling.groupby('DATE')['UNIT_SALES'].sum().reset_index()

# Sorting the date
mulberry_diet_sparkling_daily = mulberry_diet_sparkling_daily.sort_values(by='DATE')

# Renaming for prophet requiremts
mulberry_diet_sparkling_daily = mulberry_diet_sparkling_daily.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

# Printing the last date
mulberry_diet_sparkling_daily.tail(1)
Out[340]:
ds y
147 2023-10-28 3896.0
In [ ]:
# Creating an instance for the model
p_model5 = Prophet()

# Fitting the model
p_model5.fit(mulberry_diet_sparkling_daily)

# Create future dates for forecasting
future5 = p_model5.make_future_dataframe(periods=52,freq = 'W')

# Make predictions
forecast5 = p_model5.predict(future5)

# Plot the forecast
plot_plotly(p_model5, forecast5)

image.png

In [342]:
# Future Sales
forecast5_future = forecast5[forecast5['ds'] > '2023-10-28']
forecast5_future
# # No Negative Sales
forecast5_future['yhat'] = forecast5_future['yhat'].clip(lower=0)

# # Year wise sales
yearly_forecast_mulberry_diet_sparkling = forecast5_future.groupby(forecast5_future['ds'].dt.year)['yhat'].sum()

# # Create a new DataFrame from the grouped data
yearly_forecast_mulberry_sparking_diet_df = pd.DataFrame({'Year': yearly_forecast_mulberry_diet_sparkling.index, 'Unit_Sales': yearly_forecast_mulberry_diet_sparkling.values})
yearly_forecast_mulberry_sparking_diet_df
Out[342]:
Year Unit_Sales
0 2023 38484.278250
1 2024 184932.248948

From the above table, last two months in 2023 plus the remaining months in 2024 gives a forecasted sale for 1 year of 223K considering diet + mulberry + sparkling water.

Accounting for Swire:

In [343]:
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
mulberry_diet = mulberry[mulberry['CALORIC_SEGMENT']  == 'DIET/LIGHT']
mulberry_diet_sparkling = mulberry_diet[mulberry_diet['CATEGORY']  == 'SPARKLING WATER']
In [344]:
mulberry_diet_sparkling['MANUFACTURER'].value_counts(normalize=True)
Out[344]:
MANUFACTURER
JOLLYS      0.571735
COCOS       0.151328
BEARS       0.141207
SWIRE-CC    0.135730
Name: proportion, dtype: float64

Swire constitutes 13% of unit sales compared to other brands in the filtered data. Hence, we will only consider 13% of the obtained forecast. Therefore, 13% of 223K computes to above 29K unit sales.

Q4 Summary¶

  • Observed dipping sales when Swire is taken into consideration. Swire has reduced significantly the sale of Mulberry flavor.
  • Considered Diet and Sparkling Water and Mulberry which has a better sale of 223K in the forecasting year.
  • Swire has 13 % of these sales, hence the total unit sales in the forecasted year is 29K.

Q6 ¶

Item Description: Diet Energy Moonlit Casava 2L Multi Jug

  • Caloric Segment: Diet
  • Market Category: Energy
  • Manufacturer: Swire-CC
  • Brand: Diet Moonlit
  • Package Type: 2L Multi Jug
  • Flavor: ‘Cassava’

Swire plans to release this product for 6 months. What will the forecasted demand be, in weeks, for this product? roduct?

Data Preparation¶

Filtering the dataset:

In [3]:
multi_jug_2l = market_demand[market_demand['PACKAGE']=='2L MULTI JUG']
diet_multi_jug_2l = multi_jug_2l[multi_jug_2l['CALORIC_SEGMENT']=='DIET/LIGHT']
swire_diet_multi_jug_2l = diet_multi_jug_2l[diet_multi_jug_2l['MANUFACTURER']=='SWIRE-CC']
swire_diet_multi_jug_2l_moonlit = swire_diet_multi_jug_2l[swire_diet_multi_jug_2l['BRAND'].str.contains('DIET MOONLIT', case=False, regex=True)]
In [4]:
df_6 = swire_diet_multi_jug_2l_moonlit[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df_6['DATE'] = pd.to_datetime(df_6['DATE'])
df_6.set_index('DATE', inplace=True)
df_6 = df_6.asfreq('W-SAT')
In [5]:
plt.plot(df_6)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
Out[5]:
Text(0, 0.5, 'UNIT_SALES')

There are NAN in some weeks. as we dig deeper we see that there are no sale on the month of august 2023.

In [6]:
# Train and Test periods
from datetime import datetime

s_date = '2023-02-25'
e_date = '2023-08-26'
df_6 = df_6.loc[df_6.index <= e_date]
train_period = df_6.loc[df_6.index < s_date]
test_period = df_6.loc[(df_6.index >= s_date) & (df_6.index <= e_date)]


start_forecast = datetime.strptime(s_date, '%Y-%m-%d').date()
end_forecast = datetime.strptime(e_date, '%Y-%m-%d').date()

Model Performance Metrics Function¶

We will create and resuse this function to measure each models performance with the data. This will include MAE, MSE, RMSE, MAPE and direct accuracy percentage.

In [7]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np

def get_scores(actual, predicted):
    # Calculate errors
    mae = mean_absolute_error(actual, predicted)
    mse = mean_squared_error(actual, predicted)
    rmse = sqrt(mse)
    percentage_diff = np.abs((actual - predicted) / actual) * 100

    # Calculate MAPE
    mape = percentage_diff.mean()

    # Calculate "Accuracy" Percentage
    accuracy_percentage = 100 - mape

    # Print metrics
    print(f"MAE: {mae}")
    print(f"MSE: {mse}")
    print(f"RMSE: {rmse}")
    print(f"MAPE: {mape}%")
    print(f"Direct 'Accuracy' Percentage: {accuracy_percentage}%")

    return pd.Series(data={'mae':mae, 'mse':mse, 'rmse':rmse, 'mape':mape, 'direct_accuracy':accuracy_percentage}, index=['mae', 'mse', 'rmse','mape','direct_accuracy'])

ARIMA¶

ARIMA is a statistical model used for time series analysis to forecast future data points by leveraging past data. It combines three main aspects: autoregression (AR), differencing (I) to make the time series stationary, and moving average (MA). The AR part exploits the relationship between an observation and a number of lagged observations, the I part involves differencing the data to achieve stationarity, and the MA part models the error of the observation as a combination of previous error terms.

In [8]:
from statsmodels.tsa.arima.model import ARIMA

# ARIMA Model fitting
order = (1, 1, 1)  # Still (p, d, q)

model = ARIMA(train_period['UNIT_SALES'], order=order)
model_fit = model.fit()

# Forecasting
forecast = model_fit.predict(start=start_forecast,
                             end=end_forecast)

arima_forecast = forecast
In [9]:
arima_score = get_scores(test_period.squeeze(), arima_forecast)
MAE: 1331.9101181487447
MSE: 2785766.362578469
RMSE: 1669.0615215079608
MAPE: 7.579412439517313%
Direct 'Accuracy' Percentage: 92.42058756048269%
In [10]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(arima_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

SARIMA¶

SARIMA extends ARIMA by explicitly accommodating and modeling seasonal effects in time series data. It includes additional seasonal elements on top of the AR, I, and MA components. SARIMA is characterized by its ability to model both non-seasonal and seasonal components of the time series data, making it more versatile than ARIMA for data with clear seasonal patterns, such as sales data around specific holidays or events. It incorporates additional parameters to handle seasonality, which are seasonal AR, seasonal differencing, and seasonal MA components, allowing it to capture seasonal fluctuations effectively, making it ideal for products with seasonal demand.

In [19]:
from statsmodels.tsa.statespace.sarimax import SARIMAX


# SARIMA Model fitting
order = (1, 1, 1)  # (p, d, q)
seasonal_order = (1, 1, 1, 52)  # (P, D, Q, s)

model = SARIMAX(train_period['UNIT_SALES'], order=order, seasonal_order=seasonal_order)
model_fit = model.fit()

# Forecasting
forecast = model_fit.predict(start=start_forecast.strftime('%Y-%m-%d'),
                             end=end_forecast.strftime('%Y-%m-%d'))
In [12]:
sarima_score = get_scores(test_period.squeeze(), sarima_forecast)
MAE: 2078.657764078491
MSE: 5833349.344575272
RMSE: 2415.2327723379526
MAPE: 12.472789935052829%
Direct 'Accuracy' Percentage: 87.52721006494717%
In [13]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(sarima_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

forecast looks good when compared to the actual sales.

Prophet¶

Prophet is a forecasting tool designed by Facebook for handling time series data that displays patterns on different time scales such as yearly, weekly, and daily. It is especially useful for data with strong seasonal effects and several seasons of historical data. Prophet works by fitting nonlinear trends with yearly, weekly, and daily seasonality, plus holiday effects. It is robust to missing data and shifts in the trend, and typically requires no manual tuning of parameters. The model accommodates seasonality through Fourier series and includes components for holidays and special events, making it well-suited for predicting demand for products around specific events or holidays, like Easter.

In [20]:
from prophet import Prophet

df_mod = train_period.reset_index()
df_mod['DATE'] = pd.to_datetime(df_mod['DATE'])
df_mod.columns = ['ds', 'y']

model = Prophet(changepoint_prior_scale=.1,seasonality_prior_scale=.1, weekly_seasonality=True)
model.fit(df_mod)

# Forecast the next 52 weeks (1 year)
future = model.make_future_dataframe(periods=53, freq='W-SAT')
forecast = model.predict(future)

prophet_forecast = forecast[['ds','yhat']]
prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast.columns = ['DATE', 'yhat']
prophet_forecast.set_index('DATE', inplace=True)
prophet_forecast = prophet_forecast.asfreq('W-SAT')
prophet_forecast = prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)]
In [15]:
prophet_score = get_scores(test_period.squeeze(), prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)].squeeze())
MAE: 2573.16930571512
MSE: 9358981.759745052
RMSE: 3059.2452925100747
MAPE: 15.66648137188985%
Direct 'Accuracy' Percentage: 84.33351862811016%

Plotting the actual vs forecast graph:

In [16]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(prophet_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

Exponential Smoothing¶

Exponential Smoothing is a time series forecasting method for univariate data that applies smoothing factors to the observations, giving more weight to recent observations while not discarding older observations entirely. It encompasses simple exponential smoothing for data with no clear trend or seasonality, and extends to Holt’s linear trend model and Holt-Winters’ seasonal model, which can account for both trends and seasonality in the data. This method is straightforward and computationally efficient, making it a good choice for producing quick forecasts in situations where data patterns are reasonably consistent over time, but may struggle with data that has complex patterns or significant irregularities.

In [21]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# ExponentialSmoothing Model fitting
model = ExponentialSmoothing(train_period['UNIT_SALES'], trend='add', seasonal='add', seasonal_periods=52).fit(smoothing_level=0.01, smoothing_slope=0.01, smoothing_seasonal=0.1)

# Forecasting

forecast_periods = ((end_forecast - start_forecast).days // 7) + 1 

exponential_forecast = model.forecast(forecast_periods)

forecast_dates = pd.date_range(start=start_forecast, periods=forecast_periods, freq='W-SAT')
In [18]:
exponential_smoothing_score = get_scores(test_period.squeeze(), exponential_forecast)
MAE: 1672.5828724262556
MSE: 4512195.604211318
RMSE: 2124.192930082227
MAPE: 9.544250972143143%
Direct 'Accuracy' Percentage: 90.45574902785685%
In [19]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(exponential_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

Performance Summary¶

In [20]:
pd.options.display.float_format = '{:.2f}'.format
q2_scores = pd.DataFrame({'Arima':arima_score, 'Sarima':sarima_score, 'Prophet':prophet_score, 'Exponential Smoothing':exponential_smoothing_score}).T
print(q2_scores)
                          mae        mse    rmse  mape  direct_accuracy
Arima                 1331.91 2785766.36 1669.06  7.58            92.42
Sarima                2078.66 5833349.34 2415.23 12.47            87.53
Prophet               2573.17 9358981.76 3059.25 15.67            84.33
Exponential Smoothing 1672.58 4512195.60 2124.19  9.54            90.46
In [21]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
fig.suptitle('Model Performance Metrics')

for i, column in enumerate(q2_scores.columns):
    row, col = divmod(i, 3)
    sns.barplot(ax=axes[row, col], x=q2_scores.index, y=q2_scores[column])
    axes[row, col].set_title(column)
    axes[row, col].set_ylabel('')

# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Based on the metrics provided, Exponential Smoothing is identified as the most suitable choice when considering a graphical similarity to the ARIMA model. It presents a good balance between accuracy and consistency in predictions, despite not having the lowest MAE, MSE, RMSE, or the highest Direct Accuracy. Specifically, its MAE of 1672.58 and MSE of 4512195.60, along with an RMSE of 2124.19, indicate a reasonable level of prediction error, which is notably closer to the performance metrics of ARIMA. Moreover, its MAPE value of 9.54% demonstrates a relatively low percentage error, indicating its efficiency in maintaining accuracy in forecasts. Although its Direct Accuracy of 90.46% is not the highest, it reflects a strong capability in accurately forecasting future values, especially when considering the importance of graphical similarity in the context of ARIMA comparisons. Thus, for ensuring a balance between graphical resemblance to ARIMA and maintaining a good level of forecast accuracy and reliability, Exponential Smoothing emerges as the optimal model.

Predicting 6 month using Exponential Smoothing¶

In [22]:
# grouping by date
df_6 = swire_diet_multi_jug_2l_moonlit[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df_6['DATE'] = pd.to_datetime(df_6['DATE'])
df_6.set_index('DATE', inplace=True)
df_6 = df_6.loc[df_6.index <= e_date]
df_6 = df_6.asfreq('W-SAT')
In [23]:
start_forecast = datetime.strptime("2024-09-02", '%Y-%m-%d').date()
In [22]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# ExponentialSmoothing Model fitting
model = ExponentialSmoothing(df_6['UNIT_SALES'], trend='add', seasonal='add', seasonal_periods=52).fit(smoothing_level=0.01, smoothing_slope=0.01, smoothing_seasonal=0.1)

# Forecasting

forecast_periods = 26 

exponential_forecast = model.forecast(forecast_periods)

forecast_dates = pd.date_range(start=start_forecast, periods=forecast_periods, freq='W-SAT')
In [25]:
# dispalying the 6month forecast
exponential_forecast
Out[25]:
2023-09-02   16170.76
2023-09-09   15186.27
2023-09-16   15677.02
2023-09-23   16999.64
2023-09-30   16727.67
2023-10-07   15485.82
2023-10-14   15160.90
2023-10-21   16290.53
2023-10-28   15278.34
2023-11-04   15843.64
2023-11-11   15831.58
2023-11-18   15807.30
2023-11-25   17151.41
2023-12-02   16644.21
2023-12-09   17270.94
2023-12-16   16978.95
2023-12-23   18534.64
2023-12-30   17596.79
2024-01-06   17048.72
2024-01-13   15438.83
2024-01-20   15804.36
2024-01-27   15687.73
2024-02-03   16612.23
2024-02-10   16418.08
2024-02-17   15646.95
2024-02-24   16181.15
Freq: W-SAT, dtype: float64

visualizing the sales in a plot:

In [26]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(exponential_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

Summary¶

The forecast for Swire's upcoming product, Diet Energy Moonlit Cassava 2L Multi Jug, was meticulously developed using a dataset tailored to match the product's characteristics: Diet caloric segment, Energy market category, manufactured by Swire-CC, and focusing on the 2L Multi Jug package type. The data was further refined to concentrate on historical sales data that aligns closely with the anticipated product profile, particularly emphasizing the Cassava flavor under the Diet Moonlit brand.

Four forecasting methodologies were evaluated for their effectiveness in predicting the product's demand over a 6-month period leading up to its launch: ARIMA, SARIMA, Prophet, and Exponential Smoothing. The assessment of these models was grounded in a comprehensive analysis using several key performance indicators: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Direct Accuracy Percentatory phase.24.

Model Performance Summary¶

Exponential Smoothing was identified as the most apt model for this forecasting task, demonstrating superior performance across multiple metrics. Despite not achieving the lowest MAE and MSE scores in absolute terms, its performance was commendably close to that of the ARIMA model, with a MAE of 1672.58, MSE of 4512195.60, and RMSE of 2124.19. Its MAPE of 9.54% and a Direct Accuracy Percentage of 90.46% underscored its robustness and consistency in prediction accuracy, making it exceptionally well-suited for the task at hand. While the SARIMA and Prophet models offered valuable insights, they did not surpass the Exponential Smoothing model in overall efficacy. The ARIMA model, known for its reliability, fell slightly behind in this specific context.

The chosen Exponential Smoothing model then provided a forecast for the new Cassava-flavored beverage, projecting weekly sales for the next 6 months. This projection anticipates the product's market performance, guiding Swire in strategic planning for production, distribution, and promotional efforts. The model forecasts an initial weekly sales figure of 16,170.76 units, with variations over the 6-month period, peaking at certain points due to seasonal adjustments and market responses.

Findings and Conclusion¶

The Exponential Smoothing model stands out for its accuracy and reliability in forecasting sales for Swire's new Diet Energy Moonlit Cassava 2L Multi Jug. It offers a nuanced understanding of expected demand, capturing both the general trend and the seasonal fluctuations. These insights are crucial for Swire's preparation and strategy formulation, ensuring that production levels are optimally aligned with market demand, and marketing initiatives are effectively timed to maximize sales potential over the product's introductory phase.

Q5

  • Item Description: Greetingle Health Beverage Woodsy Yellow .5L 12One Jug
  • Caloric Segment: Regular
  • Market Category: ING Enhanced Water
  • Manufacturer: Swire-CC
  • Brand: Greetingle
  • Package Type: .5L 12One Jug
  • Flavor: ‘Woodsy Yellow’

Swire plans to release this product for 13 weeks, but only in one region. Which region would it perform best in?

Look at region wise distribution for two parameter - manufacturer & brand:

  • With Swire + Greetingle set we only have ING ENHANCED WATER category which matches with the category, however Caloric Segment available is only Diet. Let's break down Regular type seperately later.
  • Now with manufacturer + Brand + Category, we observe the below distribution of total unit sales:

image.png

With regards to package type ".5L 12One Jug", above 3 (manufacturer + Brand + Category) filters set, we see that there are about 5 package types the beverages were sold. Over 90% of the sales have come from "18SMALL MULTI JUG" type.

Looks like package type ".5L 12One Jug" planned to be used for this innovative product were not used much by Greetingle & Swire and would not have any impact on identifying best region keeping technical & domain knowledge in mind and hence, we will neglect package type from further analysis.

On applying flavor ‘Woodsy Yellow’ + Regular + ING ENHANCED WATER, we see only COCOS manufacturer has experimented with this combination. Let's gauge the region wise sales:

image.png

Lets compare sales by region for Woodsy Yellow, overall sales vs ENH water vs Regular:

image.png

From all the analysis uptil this point, we have seen that Central, North West and South regions have consistently topped the charts be it Brand/Manufacturer/Caloric Segment/Flavor/Categor and even for a combination of these.

Lets deep dive into these 3 regions by time. Since we don't have to forecast or look at the absolute figures of sales, ideal chart in this case would be a ribbon chart over a regular line chart:

image.png

For Flavor alone: Central with little competition from South

Flavor + Regular: Central clear winner

Flavor + Regular + ENH Water: NW clear winner

Now for brand, brand + ENH Water:

image.png

For Brand alone : Central winner for the most part

Brand + ENH Water: Central again

From the previous two results, we can safely eliminate South region as it has some significant presence for flavor alone but when other parameter are filtered on top of it, then the sales dies rapidly compared to NW & Central regions.

Top Insights:

  1. Greetingle brand has a strong presence in Utah & Colorado states.
  2. Flavor Woodsy Yellow dominates overall in South but with Regular and ENH Water filters, it is Washington & Oregon states that leads.

Central vs NW Statistical Analysis:

In [240]:
consumer_demographics_zip_region = consumer_demographics[['Zip', 'State','REGION']]

consumer_demographics_region = consumer_demographics_zip_region[(consumer_demographics_zip_region['REGION'] == 'Central') | (consumer_demographics_zip_region['REGION'] == 'North West') ]
consumer_demographics_region['REGION'].value_counts()
Out[240]:
REGION
North West    990
Central       800
Name: count, dtype: int64
In [241]:
# (consumer_demographics_region['Zip'].value_counts()>1).sum()
In [242]:
zip_cons = pd.merge(consumer_demographics_region, zip_to_market, left_on='Zip', right_on='ZIP_CODE', how='inner')
zip_cons = zip_cons.drop_duplicates(subset=['MARKET_KEY', 'REGION'])
In [243]:
merged_data = pd.merge(market_demand, zip_cons, on='MARKET_KEY', how='inner')
In [244]:
merged_data = merged_data.drop(['Zip','ZIP_CODE'], axis=1)
In [245]:
merged_data['State'].value_counts(normalize=True)
Out[245]:
State
CO    0.381271
WA    0.263675
OR    0.228770
UT    0.126284
Name: proportion, dtype: float64
In [246]:
# Group by 'State' and calculate summary statistics for 'UNIT_SALES'
unit_sales_summary_by_state = merged_data.groupby('State')['UNIT_SALES'].describe()

unit_sales_summary_by_state
Out[246]:
count mean std min 25% 50% 75% max
State
CO 4409784.0 158.650056 426.420239 0.04 11.0 43.0 133.0 11219.0
OR 2645953.0 156.398806 356.790827 0.04 13.0 49.0 144.0 11100.0
UT 1460597.0 196.563396 471.057990 0.04 13.0 51.0 167.0 14773.0
WA 3049667.0 145.675230 331.236478 0.04 12.0 48.0 138.0 9243.0

While Utah has the least number of sales count, it has the highest average sales per order. The max values suggests strong possibility of outliers or very large orders.

Note: Since, it was informed that the dataset is cleaned, we will not analyze further and remove those records.

In [247]:
print("unit sales:")
region_sales_stats = merged_data.groupby('REGION')['UNIT_SALES'].describe()
print(region_sales_stats)

# Group the data by 'REGION' and compute summary statistics for 'DOLLAR_SALES'
region_sales_summary = merged_data.groupby('REGION')['DOLLAR_SALES'].describe()

print("dollar sales:")
# Display the summary statistics
print(region_sales_summary)
unit sales:
                count        mean         std   min   25%   50%    75%      max
REGION                                                                         
Central     5870381.0  168.083194  438.258413  0.04  12.0  45.0  140.0  14773.0
North West  5695620.0  150.656967  343.386241  0.04  13.0  49.0  141.0  11100.0
dollar sales:
                count        mean          std   min    25%     50%     75%       max
REGION                                                                               
Central     5870381.0  575.904097  1464.522706  0.01  40.16  156.31  500.48  69472.26
North West  5695620.0  487.903103  1122.230124  0.01  44.68  159.52  458.65  46137.39

No significant difference for the unit sales, however Central region has more dollar sales worth of beverages sold over NW in the overall dataset.

In [248]:
merged_data['DATE'] = pd.to_datetime(merged_data['DATE'])

earliest_date = merged_data['DATE'].min()

# Step 2: Calculate week numbers
merged_data['WEEK_NUMBER'] = ((merged_data['DATE'] - earliest_date).dt.days // 7) % 52

# Filter data for Central region
central = merged_data[merged_data['REGION'] == 'Central']

# Filter data for Northwest region
northwest = merged_data[merged_data['REGION'] == 'North West']

Now the plan of action here is similar to how we had two batch of filters applied for the visuals at top for this question, we will go ahead and filter the data accordingly and then build a prophet model to compare best 13 weeks sales region wise. This will help us arrive at the best region with statistical analysis support.

Greetingle + ENH Water Filtered Modelling:

In [249]:
# Step 3: Aggregate the data for every week

northwest_fil_1 = northwest[(northwest['BRAND']=='GREETINGLE') & (northwest['CATEGORY']=='ING ENHANCED WATER')]
northwest_fil_1 = northwest_fil_1.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# central_fil_1 = northwest[(northwest['BRAND']=='GREETINGLE') & (northwest['CATEGORY']=='ING ENHANCED WATER')]
# weekly_data = northwest_fil_1.groupby('WEEK_NUMBER')['UNIT_SALES'].sum().reset_index()
northwest_fil_1.tail(1)
Out[249]:
DATE UNIT_SALES
147 2023-10-28 19875.67

The last date is 2023-10-28

In [ ]:
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly

# Initialize and fit the Prophet model
p_model = Prophet()

nw_model_data1 = northwest_fil_1.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

p_model.fit(nw_model_data1)

# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=52, freq='W')

# Make predictions
forecast = p_model.predict(future)

# Plot the forecast
plot_plotly(p_model, forecast)

image.png

Observing a negative trend. Filtering the forecast of last 1 year:

In [251]:
# week wise sales for best 13 weeks
nw_fil1_forecast_1year = forecast[forecast['ds'] > '2023-10-28']
earliest_date = nw_fil1_forecast_1year['ds'].min()
nw_fil1_forecast_1year

# Step 2: Calculate week numbers
nw_fil1_forecast_1year['WEEK_NUMBER'] = ((nw_fil1_forecast_1year['ds'] - earliest_date).dt.days // 7) % 52
weekly_forecast_data = nw_fil1_forecast_1year.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()

Taking rolling sum 13 week aggregate wise:

In [252]:
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data) < 13:
    print("Insufficient data to compute rolling sum.")
else:
    # Initialize an empty list to store the results
    rolling_sales_sum = []

    # Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
    for i in range(len(weekly_forecast_data) - 12):
        # Calculate the sum of UNIT_SALES for the current 13-week interval
        sales_sum = weekly_forecast_data.loc[i:i+12, 'yhat'].sum()
        # Construct the 13-week interval string
        interval = f"{weekly_forecast_data.iloc[i]['WEEK_NUMBER']}-{weekly_forecast_data.iloc[i+12]['WEEK_NUMBER']}"
        # Append the result to the list
        rolling_sales_sum.append((sales_sum, interval))

    # Create a new DataFrame from the results
    rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
In [23]:
plotting:
  Cell In[23], line 1
    plotting:
             ^
SyntaxError: invalid syntax
In [253]:
import matplotlib.pyplot as plt

# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns

# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']

# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]

# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
             xy=(max_interval, max_value),
             xytext=(70, -20),
             textcoords='offset points',
             arrowprops=dict(facecolor='red', arrowstyle='->'),
             fontsize=10,
             color='red',
             ha='center')

plt.tight_layout()
plt.show()

Best 13 weeks obtained is 29-41 with a sale figure of 300k.

In [254]:
total_fil1_nw = weekly_forecast_data['yhat'].sum()
total_13_week_sum = rolling_sales_df[rolling_sales_df['13_WEEK_INTERVAL']== '29.0-41.0'].sum()['SALES_13W_SUM']
percentage = total_13_week_sum*100/total_fil1_nw
print(round(percentage,2))
30.62

Best 13 weeks accounts for 30.62% of the 1 year sales. Lets do the same for central region now:

In [255]:
central_fil_1 = central[(central['BRAND']=='GREETINGLE') & (central['CATEGORY']=='ING ENHANCED WATER')]
central_fil_1 = central_fil_1.groupby('DATE')['UNIT_SALES'].sum().reset_index()
central_fil_1.tail(1)
Out[255]:
DATE UNIT_SALES
147 2023-10-28 21070.0

The last date is 2023-10-28

In [ ]:
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly

# Initialize and fit the Prophet model
p_model_1 = Prophet()

cw_model_data1 = central_fil_1.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

p_model_1.fit(cw_model_data1)

# Create future dates for forecasting
future_1 = p_model_1.make_future_dataframe(periods=52, freq='W')

# Make predictions
forecast_1 = p_model_1.predict(future_1)

# Plot the forecast
plot_plotly(p_model_1, forecast_1)

image.png

In [257]:
# week wise sales for best 13 weeks
cw_fil1_forecast_1year = forecast_1[forecast_1['ds'] > '2023-10-28']
earliest_date_1 = cw_fil1_forecast_1year['ds'].min()

# Step 2: Calculate week numbers
cw_fil1_forecast_1year['WEEK_NUMBER'] = ((cw_fil1_forecast_1year['ds'] - earliest_date_1).dt.days // 7) % 52
weekly_forecast_data_1 = cw_fil1_forecast_1year.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
In [258]:
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data_1) < 13:
    print("Insufficient data to compute rolling sum.")
else:
    # Initialize an empty list to store the results
    rolling_sales_sum = []

    # Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
    for i in range(len(weekly_forecast_data_1) - 12):
        # Calculate the sum of UNIT_SALES for the current 13-week interval
        sales_sum = weekly_forecast_data_1.loc[i:i+12, 'yhat'].sum()
        # Construct the 13-week interval string
        interval = f"{weekly_forecast_data_1.iloc[i]['WEEK_NUMBER']}-{weekly_forecast_data_1.iloc[i+12]['WEEK_NUMBER']}"
        # Append the result to the list
        rolling_sales_sum.append((sales_sum, interval))

    # Create a new DataFrame from the results
    rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
In [259]:
import matplotlib.pyplot as plt

# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns

# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']

# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]

# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
             xy=(max_interval, max_value),
             xytext=(70, -20),
             textcoords='offset points',
             arrowprops=dict(facecolor='red', arrowstyle='->'),
             fontsize=10,
             color='red',
             ha='center')

plt.tight_layout()
plt.show()
In [260]:
total_fil1_central = weekly_forecast_data_1['yhat'].sum()
total_13_week_sum_central = rolling_sales_df[rolling_sales_df['13_WEEK_INTERVAL']== '25.0-37.0'].sum()['SALES_13W_SUM']
percentage = total_13_week_sum_central*100/total_fil1_central
print(round(percentage,2))
31.24

Best 13 weeks accounts for 31.24% of the 1 year sales.

Woodsy Yellow + Regular + Enh Water Filter Modelling¶

In [261]:
northwest = merged_data[merged_data['REGION'] == 'North West']
woodsy_yellow = northwest[(northwest['ITEM'].str.contains('WOODSY  YELLOW ', case=False, regex=True))]
woodsy_yellow_regular = woodsy_yellow[woodsy_yellow['CALORIC_SEGMENT'] == 'REGULAR']
woodsy_yellow_regular_enhanced_water = woodsy_yellow_regular[woodsy_yellow_regular['CATEGORY'] == 'ING ENHANCED WATER']
In [262]:
woodsy_yellow_regular_enhanced_water['MANUFACTURER'].value_counts()
Out[262]:
MANUFACTURER
COCOS    12285
Name: count, dtype: int64
In [263]:
northwest_woodsy = woodsy_yellow_regular_enhanced_water.groupby('DATE')['UNIT_SALES'].sum().reset_index()
northwest_woodsy.tail(1)
Out[263]:
DATE UNIT_SALES
147 2023-10-28 9717.0
In [ ]:
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly

# Initialize and fit the Prophet model
p_model_3 = Prophet()

northwest_woodsy = northwest_woodsy.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

p_model_3.fit(northwest_woodsy)

# Create future dates for forecasting
future_3 = p_model_3.make_future_dataframe(periods=52, freq='W')

# Make predictions
forecast_3 = p_model_3.predict(future_3)

# Plot the forecast
plot_plotly(p_model_3, forecast_3)

image.png

In [265]:
# week wise sales for best 13 weeks
northwest_woodsy_forecast_1year = forecast_3[forecast_3['ds'] > '2023-10-28']
earliest_date_4 = northwest_woodsy_forecast_1year['ds'].min()

# Step 2: Calculate week numbers
northwest_woodsy_forecast_1year['WEEK_NUMBER'] = ((northwest_woodsy_forecast_1year['ds'] - earliest_date_4).dt.days // 7) % 52
weekly_forecast_data_northwest_woodsy = northwest_woodsy_forecast_1year.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
In [266]:
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data_northwest_woodsy) < 13:
    print("Insufficient data to compute rolling sum.")
else:
    # Initialize an empty list to store the results
    rolling_sales_sum = []

    # Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
    for i in range(len(weekly_forecast_data_northwest_woodsy) - 12):
        # Calculate the sum of UNIT_SALES for the current 13-week interval
        sales_sum = weekly_forecast_data_northwest_woodsy.loc[i:i+12, 'yhat'].sum()
        # Construct the 13-week interval string
        interval = f"{weekly_forecast_data_northwest_woodsy.iloc[i]['WEEK_NUMBER']}-{weekly_forecast_data_northwest_woodsy.iloc[i+12]['WEEK_NUMBER']}"
        # Append the result to the list
        rolling_sales_sum.append((sales_sum, interval))

    # Create a new DataFrame from the results
    rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
In [267]:
import matplotlib.pyplot as plt

# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns

# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']

# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]

# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
             xy=(max_interval, max_value),
             xytext=(70, -20),
             textcoords='offset points',
             arrowprops=dict(facecolor='red', arrowstyle='->'),
             fontsize=10,
             color='red',
             ha='center')

plt.tight_layout()
plt.show()
In [268]:
total_fil1_northwest_woodsy = weekly_forecast_data_northwest_woodsy['yhat'].sum()
total_13_week_sum_northwest_woodsy = rolling_sales_df[rolling_sales_df['13_WEEK_INTERVAL']== '30.0-42.0'].sum()['SALES_13W_SUM']
percentage = total_13_week_sum_northwest_woodsy*100/total_fil1_northwest_woodsy
print(round(percentage,2))
30.98

Best 13 weeks accounts for 30.98% of the 1 year sales. Lets do the same for Central region and South regions

In [269]:
central = merged_data[merged_data['REGION'] == 'Central']
woodsy_yellow_central = central[(central['ITEM'].str.contains('WOODSY  YELLOW ', case=False, regex=True))]
woodsy_yellow_central_regular = woodsy_yellow_central[woodsy_yellow_central['CALORIC_SEGMENT'] == 'REGULAR']
woodsy_yellow_central_regular_enhanced_water = woodsy_yellow_central_regular[woodsy_yellow_central_regular['CATEGORY'] == 'ING ENHANCED WATER']
In [270]:
woodsy_yellow_central_regular_enhanced_water['MANUFACTURER'].value_counts()
Out[270]:
MANUFACTURER
COCOS    11438
Name: count, dtype: int64
In [271]:
central_woodsy = woodsy_yellow_central_regular_enhanced_water.groupby('DATE')['UNIT_SALES'].sum().reset_index()
central_woodsy.tail(1)
Out[271]:
DATE UNIT_SALES
147 2023-10-28 7127.0
In [ ]:
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly

# Initialize and fit the Prophet model
p_model_4 = Prophet()

central_woodsy = central_woodsy.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})

p_model_4.fit(central_woodsy)

# Create future dates for forecasting
future_5 = p_model_4.make_future_dataframe(periods=52, freq='W')

# Make predictions
forecast_5 = p_model_4.predict(future_5)

# Plot the forecast
plot_plotly(p_model_4, forecast_5)

image.png

In [273]:
# week wise sales for best 13 weeks
central_woodsy_forecast_1year = forecast_5[forecast_5['ds'] > '2023-10-28']
earliest_date_6 = central_woodsy_forecast_1year['ds'].min()

# Step 2: Calculate week numbers
central_woodsy_forecast_1year['WEEK_NUMBER'] = ((central_woodsy_forecast_1year['ds'] - earliest_date_6).dt.days // 7) % 52
weekly_forecast_data_central_woodsy = central_woodsy_forecast_1year.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
In [274]:
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data_central_woodsy) < 13:
    print("Insufficient data to compute rolling sum.")
else:
    # Initialize an empty list to store the results
    rolling_sales_sum = []

    # Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
    for i in range(len(weekly_forecast_data_central_woodsy) - 12):
        # Calculate the sum of UNIT_SALES for the current 13-week interval
        sales_sum = weekly_forecast_data_central_woodsy.loc[i:i+12, 'yhat'].sum()
        # Construct the 13-week interval string
        interval = f"{weekly_forecast_data_central_woodsy.iloc[i]['WEEK_NUMBER']}-{weekly_forecast_data_central_woodsy.iloc[i+12]['WEEK_NUMBER']}"
        # Append the result to the list
        rolling_sales_sum.append((sales_sum, interval))

    # Create a new DataFrame from the results
    rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
In [275]:
import matplotlib.pyplot as plt

# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns

# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']

# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]

# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
             xy=(max_interval, max_value),
             xytext=(70, -20),
             textcoords='offset points',
             arrowprops=dict(facecolor='red', arrowstyle='->'),
             fontsize=10,
             color='red',
             ha='center')

plt.tight_layout()
plt.show()
In [276]:
total_fil1_central_woodsy = weekly_forecast_data_central_woodsy['yhat'].sum()
total_13_week_sum_northwest_central = rolling_sales_df[rolling_sales_df['13_WEEK_INTERVAL']== '32.0-44.0'].sum()['SALES_13W_SUM']
percentage = total_13_week_sum_northwest_central*100/total_fil1_central_woodsy
print(round(percentage,2))
31.14

Best 13 weeks accounts for 31.14% of the 1 year sales. Lets do the same for central region now:

In [277]:
import pandas as pd

data = {
    'Region': ['NorthWest', 'Central', 'NorthWest', 'Central'],
    'Filter': ['Brand  + Category', 'Brand  + Category ', 'Flavor + Caloric + Category ', 'Flavor  + Caloric + Category '],
    'Best 13 Weeks (Forecasted)': ['29 - 41', '25 - 37', '30 - 42', '32 - 44'],
    'Absolute Value (Unit Sales)': [300731, 272066, 173096, 108107],
    'Absolute Value Percentage for total weeks': ['30.62%', '31.24%', '30.98%', '31.14%']
}

df = pd.DataFrame(data)

df
Out[277]:
Region Filter Best 13 Weeks (Forecasted) Absolute Value (Unit Sales) Absolute Value Percentage for total weeks
0 NorthWest Brand + Category 29 - 41 300731 30.62%
1 Central Brand + Category 25 - 37 272066 31.24%
2 NorthWest Flavor + Caloric + Category 30 - 42 173096 30.98%
3 Central Flavor + Caloric + Category 32 - 44 108107 31.14%
  • Based on the values of both the filters, NorthWest constituting higher absolute sales. Hence, NW will be the final region.
  • The product could be launched in Northwest region with expected sales for the forecasted year to be around 173k with the best 13 weeks is 30 - 42.

Q7

Item Description: Peppy Gentle Drink Pink Woodsy .5L Multi Jug

  • Caloric Segment: Regular
  • Type: SSD
  • Manufacturer: Swire-CC
  • Brand: Peppy
  • Package Type: .5L Multi Jug
  • Flavor: ‘Pink Woodsy’

Swire plans to release this product in the Southern region for 13 weeks. What will the forecasted demand be, in weeks, for this product?

Data Preparation¶

In [3]:
southern_keys = [
    259, 260, 524, 272, 531, 277, 534, 535, 793, 794, 799, 801, 294, 807, 44,
    557, 310, 825, 316, 831, 323, 585, 849, 862, 866, 867, 358, 365, 622, 367,
    882, 891, 893, 895, 900, 913, 409, 673, 674, 939, 179, 949, 951, 952, 953,
    197, 967, 205, 210, 979, 212, 980, 981, 216, 729, 220, 733, 478, 739, 743,
    746, 748, 238, 754, 759, 766
]

# Filter the DataFrame
southern = market_demand[market_demand['MARKET_KEY'].isin(southern_keys)]
In [4]:
peppy = southern[(southern['BRAND'].str.contains('peppy', case=False, regex=True))]
peppy_regular = peppy[(peppy['CALORIC_SEGMENT'].str.contains('Regular', case=False, regex=True))]
peppy_regular_ssd = peppy_regular[(peppy_regular['CALORIC_SEGMENT'].str.contains('Regular', case=False, regex=True))]
peppy_regular_ssd_multi_jug = peppy_regular_ssd[(peppy_regular_ssd['PACKAGE'].str.contains('Multi Jug', case=False, regex=True))]
In [5]:
df = peppy_regular_ssd_multi_jug[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)
df = df.asfreq('W-SAT')
In [6]:
plt.plot(df)
Out[6]:
[<matplotlib.lines.Line2D at 0x290d3368b90>]

There are NAN in some weeks. as we dig deeper we see that there are no sale on the month of august 2023.

In [7]:
# Train and Test periods
from datetime import datetime

s_date = '2023-06-03'
e_date = '2023-08-26'
df = df.loc[df.index <= e_date]
train_period = df.loc[df.index < s_date]
test_period = df.loc[(df.index >= s_date) & (df.index <= e_date)]


start_forecast = datetime.strptime(s_date, '%Y-%m-%d').date()
end_forecast = datetime.strptime(e_date, '%Y-%m-%d').date()

Model Performance Metrics Function¶

We will create and resuse this function to measure each models performance with the data. This will include MAE, MSE, RMSE, MAPE and direct accuracy percentage.

In [8]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np

def get_scores(actual, predicted):
    # Calculate errors
    mae = mean_absolute_error(actual, predicted)
    mse = mean_squared_error(actual, predicted)
    rmse = sqrt(mse)
    percentage_diff = np.abs((actual - predicted) / actual) * 100

    # Calculate MAPE
    mape = percentage_diff.mean()

    # Calculate "Accuracy" Percentage
    accuracy_percentage = 100 - mape

    # Print metrics
    print(f"MAE: {mae}")
    print(f"MSE: {mse}")
    print(f"RMSE: {rmse}")
    print(f"MAPE: {mape}%")
    print(f"Direct 'Accuracy' Percentage: {accuracy_percentage}%")

    return pd.Series(data={'mae':mae, 'mse':mse, 'rmse':rmse, 'mape':mape, 'direct_accuracy':accuracy_percentage}, index=['mae', 'mse', 'rmse','mape','direct_accuracy'])

ARIMA¶

ARIMA is a statistical model used for time series analysis to forecast future data points by leveraging past data. It combines three main aspects: autoregression (AR), differencing (I) to make the time series stationary, and moving average (MA). The AR part exploits the relationship between an observation and a number of lagged observations, the I part involves differencing the data to achieve stationarity, and the MA part models the error of the observation as a combination of previous error terms.

In [9]:
from statsmodels.tsa.arima.model import ARIMA

# ARIMA Model fitting
order = (2, 1, 2)  # Still (p, d, q)

model = ARIMA(train_period['UNIT_SALES'], order=order)
model_fit = model.fit()

# Forecasting
forecast = model_fit.predict(start=start_forecast,
                             end=end_forecast)

arima_forecast = forecast
In [10]:
arima_score = get_scores(test_period.squeeze(), arima_forecast)
MAE: 10610.921900092897
MSE: 133607735.47205538
RMSE: 11558.881237907732
MAPE: 3.682049809516563%
Direct 'Accuracy' Percentage: 96.31795019048344%
In [11]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(arima_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

SARIMA¶

SARIMA extends ARIMA by explicitly accommodating and modeling seasonal effects in time series data. It includes additional seasonal elements on top of the AR, I, and MA components. SARIMA is characterized by its ability to model both non-seasonal and seasonal components of the time series data, making it more versatile than ARIMA for data with clear seasonal patterns, such as sales data around specific holidays or events. It incorporates additional parameters to handle seasonality, which are seasonal AR, seasonal differencing, and seasonal MA components, allowing it to capture seasonal fluctuations effectively, making it ideal for products with seasonal demand.

In [26]:
from statsmodels.tsa.statespace.sarimax import SARIMAX


# SARIMA Model fitting
order = (1, 1, 1)  # (p, d, q)
seasonal_order = (1, 1, 1, 52)  # (P, D, Q, s)

model = SARIMAX(train_period['UNIT_SALES'], order=order, seasonal_order=seasonal_order)
model_fit = model.fit()

# Forecasting
forecast = model_fit.predict(start=start_forecast.strftime('%Y-%m-%d'),
                             end=end_forecast.strftime('%Y-%m-%d'))
In [13]:
sarima_score = get_scores(test_period.squeeze(), sarima_forecast)
MAE: 24348.214490805636
MSE: 801042551.4448053
RMSE: 28302.695126874496
MAPE: 8.420642444814801%
Direct 'Accuracy' Percentage: 91.5793575551852%
In [14]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(sarima_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

Prophet¶

Prophet is a forecasting tool designed by Facebook for handling time series data that displays patterns on different time scales such as yearly, weekly, and daily. It is especially useful for data with strong seasonal effects and several seasons of historical data. Prophet works by fitting nonlinear trends with yearly, weekly, and daily seasonality, plus holiday effects. It is robust to missing data and shifts in the trend, and typically requires no manual tuning of parameters. The model accommodates seasonality through Fourier series and includes components for holidays and special events, making it well-suited for predicting demand for products around specific events or holidays, like Easter.

In [25]:
from prophet import Prophet

df_mod = train_period.reset_index()
df_mod['DATE'] = pd.to_datetime(df_mod['DATE'])
df_mod.columns = ['ds', 'y']

model = Prophet(changepoint_prior_scale=0.1,seasonality_prior_scale=0.5, weekly_seasonality=True)
model.fit(df_mod)

# Forecast the next 52 weeks (1 year)
future = model.make_future_dataframe(periods=53, freq='W-SAT')
forecast = model.predict(future)

prophet_forecast = forecast[['ds','yhat']]
prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast.columns = ['DATE', 'yhat']
prophet_forecast.set_index('DATE', inplace=True)
prophet_forecast = prophet_forecast.asfreq('W-SAT')
prophet_forecast = prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)]
In [16]:
prophet_score = get_scores(test_period.squeeze(), prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)].squeeze())
MAE: 14082.66495499823
MSE: 257837638.2131654
RMSE: 16057.32350714668
MAPE: 4.88229507297955%
Direct 'Accuracy' Percentage: 95.11770492702045%
In [17]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(prophet_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

Exponential Smoothing¶

Exponential Smoothing is a time series forecasting method for univariate data that applies smoothing factors to the observations, giving more weight to recent observations while not discarding older observations entirely. It encompasses simple exponential smoothing for data with no clear trend or seasonality, and extends to Holt’s linear trend model and Holt-Winters’ seasonal model, which can account for both trends and seasonality in the data. This method is straightforward and computationally efficient, making it a good choice for producing quick forecasts in situations where data patterns are reasonably consistent over time, but may struggle with data that has complex patterns or significant irregularities.

In [24]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# ExponentialSmoothing Model fitting
model = ExponentialSmoothing(train_period['UNIT_SALES'], trend='add', seasonal='add', seasonal_periods=52).fit(smoothing_level=0.6, smoothing_slope=0.2, smoothing_seasonal=0.2)

# Forecasting

forecast_periods = ((end_forecast - start_forecast).days // 7) + 1

exponential_forecast = model.forecast(forecast_periods)

forecast_dates = pd.date_range(start=start_forecast, periods=forecast_periods, freq='W-SAT')
In [19]:
exponential_smoothing_score = get_scores(test_period.squeeze(), exponential_forecast)
MAE: 23907.5251431621
MSE: 744489010.3472316
RMSE: 27285.325916089616
MAPE: 8.281600968319658%
Direct 'Accuracy' Percentage: 91.71839903168035%
In [20]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(exponential_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()

Performance Summary¶

In [21]:
pd.options.display.float_format = '{:.2f}'.format
q2_scores = pd.DataFrame({'Arima':arima_score, 'Sarima':sarima_score, 'Prophet':prophet_score, 'Exponential Smoothing':exponential_smoothing_score}).T
print(q2_scores)
                           mae          mse     rmse  mape  direct_accuracy
Arima                 10610.92 133607735.47 11558.88  3.68            96.32
Sarima                24348.21 801042551.44 28302.70  8.42            91.58
Prophet               14082.66 257837638.21 16057.32  4.88            95.12
Exponential Smoothing 23907.53 744489010.35 27285.33  8.28            91.72
In [22]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
fig.suptitle('Model Performance Metrics')

for i, column in enumerate(q2_scores.columns):
    row, col = divmod(i, 3)
    sns.barplot(ax=axes[row, col], x=q2_scores.index, y=q2_scores[column])
    axes[row, col].set_title(column)
    axes[row, col].set_ylabel('')  # Removing the y-label for cleanliness

# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Based on the updated metrics, the Prophet model appears to be the most suitable choice for this scenario, especially when considering the graph resemblance to actual values as a crucial criterion. Although its MAE, MSE, and RMSE values are not the lowest among the compared models, they indicate a reasonable accuracy level in predictions close to the actual values. The Prophet model's MAPE is relatively low, suggesting that its predictions' percentage errors are acceptable, contributing to its overall forecast reliability. Furthermore, its Direct Accuracy is impressively high, only slightly below the highest, showcasing its effectiveness in closely matching the actual trends. Given the importance of graph resemblance to actuals for this analysis, the Prophet model, with its balance between statistical accuracy and visual trend matching, is deemed the most appropriate choice.

Predicting forecast for 13 weeks using Prophet¶

In [23]:
df = peppy_regular_ssd_multi_jug[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)
df = df.loc[df.index <= "2023-08-26"]
df = df.asfreq('W-SAT')
In [24]:
from prophet import Prophet

df_mod = df.reset_index()
df_mod['DATE'] = pd.to_datetime(df_mod['DATE'])
df_mod.columns = ['ds', 'y']

model = Prophet(changepoint_prior_scale=0.1,seasonality_prior_scale=0.5, weekly_seasonality=True)
model.fit(df_mod)

future = model.make_future_dataframe(periods=13, freq='W-SAT')
forecast = model.predict(future)

prophet_forecast = forecast[['ds','yhat','yhat_upper','yhat_lower']]
prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast.columns = ['DATE', 'yhat','yhat_upper','yhat_lower']
prophet_forecast.set_index('DATE', inplace=True)
prophet_forecast = prophet_forecast.asfreq('W-SAT')
prophet_forecast = prophet_forecast.loc[(prophet_forecast.index >= "2023-09-02")]
20:33:04 - cmdstanpy - INFO - Chain [1] start processing
20:33:04 - cmdstanpy - INFO - Chain [1] done processing
C:\Users\Michael Mendoza\AppData\Local\Temp\ipykernel_19324\963624670.py:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
In [25]:
prophet_forecast
Out[25]:
yhat yhat_upper yhat_lower
DATE
2023-09-02 285476.40 298778.29 271838.69
2023-09-09 288354.50 300856.25 276032.32
2023-09-16 289057.31 302472.57 274802.73
2023-09-23 287248.53 300958.92 273797.52
2023-09-30 285106.30 298506.32 271294.70
2023-10-07 284271.78 298363.42 271306.87
2023-10-14 283922.05 297361.37 270622.10
2023-10-21 282352.50 296781.29 268788.65
2023-10-28 279677.03 294189.70 266454.82
2023-11-04 277865.93 290756.87 264269.66
2023-11-11 278087.04 290981.57 264992.78
2023-11-18 279119.81 293013.24 265333.38
2023-11-25 279277.11 293360.46 265840.58
In [26]:
plt.figure(figsize=(10, 6))
plt.plot(prophet_forecast.index, prophet_forecast['yhat'], label='Predicted', color='blue', marker='o')
plt.fill_between(prophet_forecast.index, prophet_forecast['yhat_lower'], prophet_forecast['yhat_upper'], color='gray', alpha=0.2, label='Confidence Interval')
plt.title('Forecast with Confidence Interval')
plt.xlabel('Date')
plt.ylabel('Forecast Value')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Summary¶

The forecasting analysis for Swire's upcoming beverage, the Peppy Gentle Drink in "Pink Woodsy" flavor, packaged in a .5L Multi Jug, focuses on the product's demand projection for a 13-week span around the Southern region's Easter 2024 period. The data was meticulously curated to match the product's attribute,, caloric segment, type, manufacturer, and packaging, from a comprehensive market demand dataset. This strategic approach allowed for an accurate representation and analysis relevant to the product's market introduction.

Seveted forecasting models were employed, including ARIMA, SARIMA, Prophet, and Exponential Smoothing, each evaluated against critical performance metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Direct 'Accuracy' Percentage. These metrics provided a comprehensive assessment of each model's accuracy and reliability in predicting future sales demand.

The Prophet model emerged as the most fitting choice for this specific forecasting task. While its MAE, MSE, and RMSE values did not register as the absolute lowest among the models evaluated, they were nonetheless indicative of a commendably high level of prediction accuracy. The Prophet model's relatively low MAPE and high Direct Accuracy Percentage further affirmed its efficacy in closely mirroring actual sales trends, a crucial factor for this analysis.

The demand forecast through the Prophet model reveals an anticipated fluctuation in sales over the target period, with projections highlighting key trends and potential peak demand times around Easter 2024. This predictive insight is invaluable for Swire in strategizing production, inventory, and marketing efforts to optimally meet anticipated consumer demand. By effectively leveraging the Prophet model's forecast, Swire can ensure product availability aligns with market demand, maximizing sales potential and customer satisfaction during this critical launch

Team Contribution

Michael: Q2, Q6 and Q7 Modelling with interpretations, Model evaluation metrics, ARIMA & SARIMA

Rawali: Q3 Analysis with Interpretations, Model Validation & evaluation

Neil: Q4 Analysis with Interpretations, Q5 Modelling, Notebook final edits

Abinav: Q1 Modelling with Interpretations, Q5 Visualizations, Prophet model, Writing part for Introduction, Data Preparation, Approach & assumptions.

Go to TOC